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Government. 


FAIL  SAFE:  Fault  Aware  IntelLigent  SoftwAre  For  Exascale 

Final  Report 

June  13,  2016 


Award  number  W911NF-13-1-0219 


Robert  F.  Lucas 
Information  Sciences  Institute 
University  of  Southern  California 
(310)448-9449 
rflucas@isi.edu 


Introduction 

By  the  end  of  this  decade,  the  Department  of  Defense  will  be  deploying  large  numbers  of  massively 
parallel  systems  to  address  a  broad  set  of  problems  ranging  from  mission  critical  challenges  such  as 
cryptanalysis,  image  processing,  decision  support,  and  weather  forecasting  to  fundamental  research 
questions  in  science  and  technology.  These  high  performance-computing  systems  will  be  constructed 
from  exascale  technology.  As  such,  they  will  be  composed  of  devices  less  reliable  as  those  used  today, 
and  faults  will  become  the  norm,  not  the  exception.  This  will  pose  significant  problems  for  Defense 
users,  who  for  half  a  century  have  enjoyed  an  execution  model  that  largely  relied  on  correct  behavior  by 
the  underlying  computing  system. 

The  University  of  Southern  California  (USC),  the  Lawrence  Livermore  National  Laboratory  (LLNL),  and  the 
Jet  Propulsion  Laboratory  (JPL)  believe  that  a  new  generation  of  dependable  applications  must  be 
developed  to  successfully  exploit  this  next  generation  of  technology.  Such  applications  and  the  systems 
they  run  on  must  be  introspective  and  adaptive,  actively  searching  for  errors  in  their  program  state  with 
hardware  mechanisms  and  new  software  techniques.  Towards  this  end,  the  Army  Research  Office  (ARO) 
funded  us  via  contract  number  W911NF-13-1-0219  to  perform  research  with  the  goal  of  developing  and 
demonstrating  the  technology  to  enable  adaptive,  application-oriented  control  of  fault  tolerance.  Our 
initial  plan  was  to  extend  the  ROSE  compiler,  the  LLVM  compiler,  and  the  SHINE  introspection  engine  so 
that  faults  injected  into  a  resilient  application's  state  will  be  detected  and  dealt  with,  whether  by 
ignoring  them,  correcting  them  if  possible,  or  reverting  to  an  earlier  checkpoint  when  necessary.  A 
reduction  in  scope  from  the  proposal  prevented  us  from  pursuing  the  work  with  LLVM. 

The  report  provides  an  overview  of  the  results  of  this  research.  Further  details  can  be  found  in  the 
publications  listed  below,  the  PhD  thesis  of  Dr.  Saurabh  Hukerikar,  and  software  that  was  delivered  to 
NSA's  R3.  The  remainder  of  this  report  is  organized  along  the  lines  requested  by  ARO  for  previous 
reports. 

Objective 

In  the  FAIL-SAFE  project,  we  extended  the  familiar  C  programming  language  to  allow  software 
developers  to  express  their  knowledge  of  the  fault  tolerance  of  their  applications.  The  approach 
pursued,  uses  a  high-level  annotation  language  for  expressing  user  knowledge  about  runtime 
correctness  conditions,  error  tolerance,  and  specific  methods  for  enhancing  reliability.  As  a  user- 
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controlled  approach,  this  approach  avoids  the  performance  and/or  energy  penalties  associated 
with  the  blind  application  of  traditional  redundancy  methods  such  as  hardware/software  Triple- 
Modular  Redundancy  [TMR],  We  extended  our  initial  resilience  framework  along  with  the  ROSE 
compiler  and  the  SHINE  introspection  engine  so  that  faults  injected  into  a  resilient  application's 
state  can  be  detected  and  dealt  with,  whether  by  ignoring  them,  correcting  them  if  possible,  or 
reverting  to  an  earlier  checkpoint  when  necessary.  The  results  of  this  research  provide  a  model  for 
the  vendors  of  Defense  systems,  and  a  prototype  capability  should  the  vendors  chose  not  to  bring 
such  technology  to  market.  The  increased  application  resilience  resulting  from  this  research  will 
lead  to  faster  completion  of  Defense  applications,  and  thus  substantial  energy  savings  as  well  as 
increased  mission  assurance. 

Approach 

At  the  beginning  of  the  FAIL-SAFE  project,  USC  has  an  existing  fault  tolerance  test  bed,  constructed 
with  prior  research  funding  that  demonstrates  that  some  uncorrectable  errors  can  be  ignored  and 
applications  still  continue  to  correct  solutions.  The  FAIL-SAFE  project  created  a  major 
generalization  of  this  system  by  [1]  developing  a  high-level  annotation  language;  (2)  automating 
the  translation  of  directives  in  this  high-level  annotation  language  as  source-to-source  code 
transformations  in  the  DoE-funded  and  supported  ROSE  compiler  infrastructure  [3]  designing  an 
interface  between  the  application  and  an  introspection  framework  for  resilience  (IFR]  based  on  the 
inference  engine  SHINE;  (4]  using  the  ROSE  compiler  to  translate  annotations  into  reasoning  rules 
for  the  IFR;  and  [5]  designing  a  Knowledge/Experience  Database,  which  will  store  knowledge  about 
dynamic  program  behavior  and  resilience  determined  by  the  IFR  to  be  leveraged  in  subsequent 
program  development  cycles.  We  implemented  this  this  technology  as  open-source  software  so  that 
Defense  users  have  access  to  it. 

Scientific  Barriers 

Prior  to  FAIL-SAFE,  we  had  already  demonstrated  that  a  knowledgeable  user  can  assert  what 
regions  of  a  program's  state  space  can  tolerate  errors,  and  that  these  programs  can  continue  to 
correct  solutions.  To  broaden  the  impact  of  this  research,  we  also  needed  to  be  able  to  ameliorate 
errors,  with  minimal  overhead.  Today,  this  is  largely  done  by  check-pointing  state,  and  then  rolling 
back  when  errors  are  detected,  and  restarting  the  code.  We  demonstrated  that  one  can  extend  a 
standard  programming  API  to  allow  a  knowledgeable  user  to  be  able  to  provide  alternative  repair 
strategies  that  will  reduce  the  frequency  of  check-pointing  and  restarting,  thus  saving  time  and 
energy.  What  these  are,  and  how  broadly  applicable  they  will  be,  remains  an  open  research 
question. 

Significance 

Today’s  standard  model  of  computation,  embodied  in  familiar  programming  languages,  assumes 
that  the  underlying  computer  runs  correctly.  This  model  is  generally  accepted,  except  in  safety 
critical  systems  like  flight  controls.  In  the  near  future,  this  will  no  longer  be  true  due  to  the 
continued  scaling  of  VSLI.  In  addition  it  will  be  prohibitively  expensive  to  enforce  total  operational 
correctness  with  error  correction  using  redundancy.  The  results  of  the  research  carried  out  in  this 
project  will  be  software  that  exploits  human  knowledge  of  what  faults  are  significant,  and  what  are 
not,  to  reduce  the  overhead  of  maintaining  the  illusion  of  perfect  computing  systems.  This  will  save 
time  and  energy  for  large-scale  Defense  computations. 
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Accomplishments 


The  FAIL-SAFE  project  was  initially  funded  in  August  of  2013.  Because  we  partnered  with  National 
laboratories  (LLNL  and  JPL],  in  took  months  to  get  them  on  contract.  In  fact,  due  to  inconsistencies 
between  the  FAR  and  the  Space  Act,  ARO  had  to  intervene  and  waive  some  clauses.  So,  in  a  very  real 
sense  the  FAIL-SAFE  team  only  came  together  in  early  2014. 

As  first  reported  in  August  2014,  the  FAIL-SAFE  team  drafted  an  assertion  language  that  allows 
users  to  communicate  fault  tolerance  knowledge  to  a  compiler  and  underlying  computing  systems. 
LLNL  has  extended  ROSE  to  parse  our  initial  assertion  language.  The  assertion  language  is  itself  a 
work  in  progress,  and  this  was  an  ongoing  process  throughout  the  remainder  of  the  project.  The 
most  recent  version,  Version  8.0,  was  distributed  in  August  of  2015,  after  we  began  collaborating 
with  Dr.  Erik  DeBenedictis  from  Sandia  National  Laboratories  (SNL]  at  the  direction  of  the 
government.  It  provided  the  assertion  language  support  for  key  concepts  outlined  in  Erik's  white 
paper  entitled  "Managing  the  End  of  Moore’s  Law".  It  is  included  as  an  appendix  to  this  report.  JPL 
also  delivered  to  USC  a  first  prototype  of  a  SHINE-generated  Introspection  Framework  for 
Resilience  (IFRJ.  This  has  been  tested  in  the  USC  resilience  test  bed,  running  on  USC's  HPC  cluster 
and  was  demonstrated  at  the  July  2014  DOD  ACS  program  workshop.  Finally,  LLNL  implemented 
aspects  of  the  assertion  language  in  ROSE,  software  that  has  been  delivered  to  NSA  R3.  These 
accomplishments  are  discussed  in  more  detail  below. 


COMPILATION 


PRE-EXECUTION  APPLICATION  EXECUTION 


Figure  1.  Application  compilation  and  fault-injection  infrastructure. 

Accomplishments:  Fault  Detection  and  Amelioration 

USC  developed  a  fault-injection  and  amelioration  analysis  infrastructure,  depicted  in  Figure  1  above. 
Using  this  infrastructure,  we  developed  a  set  of  computational  kernels  to  be  analyzed  for 
robustness  and  refined  the  type  of  failures  observed.  Figure  2  below  depicts  the  impact  of  faults 
randomly  injected  into  two  computational  kernels.  The  injection  rate  varies  from  once  every  fifteen 
minutes,  to  once  per  minute.  The  outcomes  are:  faults  are  detected  and  corrected;  benign  faults 
where  the  application  succeeds  even  though  faults  are  silent;  undetected  faults  leading  to  incorrect 
computational  outcomes;  and  application  crashes.  The  Graph  500  breadth-first-search  algorithm 
contains  several  pointer-related  computations  to  traverse  the  graph  edges.  Therefore,  almost  50% 
of  the  execution  runs  can  detect  and  correct  the  corruptions  in  the  pointer  arithmetic.  However, 
since  the  other  parts  of  the  computational  environment  as  well  the  graph  vertex  data  elements 
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contain  no  error  management  knowledge,  the  application  fails  more  often  at  very  high  fault  rates. 
The  AMG  code  is  naturally  resilient  to  errors.  Most  of  its  memory  is  allocated  to  the  intermediate 
solution  grids  at  each  level  in  the  V-cycle,  and  as  an  iterative  algorithm,  it  can  often  reach  a  solution 
in  spite  of  errors.  The  pointer  arithmetic  is  the  most  sensitive  to  silent  corruptions  and  the  robust 
qualifiers  create  redundant  copies  which  allow  detection  and  correction  for  errors  injected  in  these 
variables. 


Silent  Data  Corruption  Detected/Corrected 
through  RoLex  Robust  Extensions 


|  Benign  Faults 


Undetected  Faults  leading  to 
Incorrect  Outcome 


Application  Crash 


Figure  2.  Evaluation  of  application  kernel  robustness  for  various  fault-injection  rates. 


At  direction  of  the  government,  we  also  implemented  simple  amelioration  strategies  to  two 
Conjugate  Gradient  implementations  (a  traditional  CG  and  a  Self-Stabilizing  CG)  as  well  as  a 
algorithmic-based  fault-tolerance  [ABFT]  strategy  for  the  popular  matrix-matrix  multiplication 
DGEMM  kernel.  In  the  case  of  DGEMM  there  are  column-  and  row-wise  checksums  that  allow  the 
implementation  to  recover  the  correct  matrix  element  value  in  the  presence  of  an  error.  For  the  CG 
kernels  the  implemented  amelioration  procedures  were  respectively  a  roll-forward  and  a  roll- 
backwards  to  the  previous  iteration  of  the  algorithm  and  using  selected  saved  data. 
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Figure  3.  Evaluation  of  Simple  Amelioration  strategies  for  DGEMM  (ABFT)  and  Conjugate  Gradient  (roll- 

forward  and  roll-backwards). 

Figure  3  above  summarizes  the  results  of  these  experiments  for  various  fault  injection  rates.  For  the 
DGEMM  code,  the  checksum-based  amelioration  is  applicable  for  only  the  static  data,  i.e.  the 
operand  matrices  which  are  initialized  at  the  beginning  and  whose  values  do  not  change 
throughout  the  execution.  A  total  of  75%  of  all  executions  converge  correctly  for  a  rate  that  injects 
an  error  every  5  minutes  but  only  27%  complete  correctly  at  the  accelerated  rate  of  an  error  per 
minute.  For  the  CG  computation,  we  leverage  the  iterative  nature  of  the  algorithm  that  allows  the 
execution  to  often  overcome  errors  on  the  solution  vector  while  the  operand  matrices  are  protected 
using  checksums. 
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Many  faults  are  only  detectable  by  the  appearance  of  erroneous  data  in  program  state.  N-modular 
redundancy  is  the  traditional  mechanism  for  detecting  these,  and  the  overhead  exceeds  N  as  there 
are  synchronization  and  comparisons  added  to  implement  it,  not  just  multiple  copies  of  the 
computation.  The  FAIL-SAFE  project  explored  an  adaptive  redundant  multithreading  variation  of 
this  well  know  technique.  The  compiler  replicates  code  blocks  identified  by  the  programmer  that 
can  serve  as  spheres  of  replication,  where  values  can  be  checked  before  errors  propagate  beyond 
the  boundaries.  An  introspective  runtime  system  can  enable  or  disable  redundancy  for  any  of  these 
code  spheres  based  on  the  fault  rates  it  is  observing.  As  shown  below,  in  figure  4,  this  reduces  the 
overhead  of  using  redundancy  to  test  for  the  presence  of  errors,  relative  to  process  pairing. 


■  Serial  ■  Redundant  Multithreading  ■  Process  Replication 


Figure  4:  Adaptive  redundant  multithreading  reduces  overhead  relative  to  full  process  replication. 


Figure  5:  Performance  overhead  of  fault  aware  thread  assignment  for  modeling  accuracy  rates  of  10 
and  90%.  Note  that  this  quickly  plateaus  once  the  faulty  cores  have  been  isolated. 

Another  tool  that  an  introspective  system  could  have  at  its  disposal  is  the  ability  to  migrate  tasks 
away  from  faulty  cores.  Faults  in  systems  are  correlated  in  both  time  and  place.  For  example, 
studies  have  shown  that  5%  of  locations  account  for  95%  of  faults.  One  can  attempt  to  anticipate 
the  occurrence  of  future  faults  based  on  historical  data.  Figure  5  above  illustrates  the  overhead 
resulting  from  anticipating  faults  and  migrating  tasks.  Note  that  the  first  ten  or  so  errors  injected 
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have  a  significant  deleterious  impact  on  performance,  but  that  it  quickly  stabilizes  as  the  fault  rate 
increases.  This  is  because  the  higher  fault  rates  provide  more  data  with  which  better  predictions 
can  be  made,  and  faulty  cores  isolated. 

Accomplishments:  ROSE  Implementation  of  the  FAIL-SAFE  Assertion  Language 

As  part  of  the  FAIL-SAFE  project,  LLNL  defined  a  set  of  language  annotations  in  the  form  of  C/C++ 
pragmas  to  meet  the  requirements  defined  by  the  draft  FAIL-SAFE  Assertion  Language.  These  were 
realized  using  LLNL's  ROSE  compiler  transformation  front-end,  a  tool  whose  ongoing  development 
at  LLNL  made  them  a  critical  member  of  the  FAIL-SAFE  team.  The  FAIL-SAFE  assertion  language 
gives  a  set  of  abstract  syntax  and  grammar  rules  to  represent  predicates  and  directives.  Predicates 
are  further  categorized  into  status  predicates  and  data  predicates.  Status  predicates  are  associated 
with  a  point  of  execution  related  to  a  statement  and  specify  a  condition  that  needs  to  be  valid 
whenever  that  point  is  reached  during  program  execution.  Data  predicates  are  associated  with 
objects  generated  in  the  context  of  a  variable,  type,  or  class  declaration  and  specify  a  condition  that 
must  hold  for  these  data  throughout  their  lifetime.  Directives  include  mostly  tolerance  and 
redundancy  directives.  A  tolerance  directive  is  to  express  tolerance  for  certain  classes  of  errors 
occurring  during  the  execution  of  the  program,  such  as  arithmetic  or  SECDED  (single-error 
correction  double-error  detection)  errors.  Redundancy  management  directives  are  mainly  used  in 
high-reliability  sections  for  the  enforcement  of  hard  correctness.  They  provide  a  set  of  methods 
based  on  the  redundancy  of  code  or  data. 

A  concrete  set  of  language  annotations  for  actual  programming  languages  are  needed  to  instantiate 
the  assertion  language  and  enable  programmers  to  use  our  assertions.  Our  design  follows  the 
pragma  convention  of  OpenMP  specification  for  easier  understanding  and  compiler 
implementation.  Both  C/C++  and  Fortran  programs  can  also  be  uniformly  handled  using  OpenMP- 
style  pragmas. 

Each  directive  starts  with  #pragma  failsafe  .  Directives  are  case-sensitive.  The  general  form  is 
# pragma  failsafe  directive-name  [clause  [[,]  clause]  ....]  new-line 

A  FailSafe  executable  directive  applies  to  at  most  one  succeeding  statement,  which  must  be  a 
structured  block.  Some  representative  example  pragmas  we  define  are  depicted  below: 

//  Status  Predicates 

x=fO(y); 

#pragma  failsafe  status  assert  (x  <  U)  error  (ET3)  recover  (R3,x,y,U) 

LI:  z=g(x); 


//  Another  status  predicate 

LOOP1:  while  cexpr 

{ 

/*  while-body  */ 

#pragma  failsafe  status  assert  (x  <  y)  error  (...) 

} 
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//  Data  predicates 
// form  1: context  based 

int  ncycles; 

#pragma  failsafe  data  assert  (0  <=  ncycles  <=  maxcycles) 

//  form  2:  using  region  reference  in(Rl) 
int  ncycles; 

#pragma  failsafe  data  assert  (0  <=  ncycles  <=  maxcycles)  in  (ncyles) 

//form  3:  using  wild  card  keyword:  allvars 
int  ncycles,  mcycles; 

#pragma  failsafe  data  assert  (0  <=  allvars  <=  maxcycles) 

Using  the  ROSE  compiler  infrastructure,  we  have  also  implemented  the  parsing  support  for  the  set 
of  C/C++  pragmas  we  defined,  including  those  specifying  assertion  region,  status  predicate,  data 
predicate,  violation  types  and  so  on.  The  implementation  is  based  on  the  extensions  to  ROSE  to 
parse  input  programs  annotated  with  our  pragmas  and  store  the  information  as  persistent 
attributes  attached  to  ROSE's  AST  (abstract  syntax  tree).  As  a  result,  we  now  have  a  proper  internal 
representation  of  programs  using  assertion  language  annotations. 

These  transformations  include  the  translation  of  an  important  subset  of  directives  and  constructs 
defined  in  the  FAIL-SAFE  language  and  were  validated  early  using  the  USC’s  tested-bed  albeit  in  a 
semi-manual  fashion.  The  automation  of  these  directives  is  completed  but  there  are  still  imitations 
about  the  data  types  and  compiler  analysis  supported  in  the  current  software  distribution.  The 
software  has  been  tested  and  delivered  at  the  request  of  the  Government  on  a  computer  cluster 
located  at  the  University  of  Maryland  and  under  the  supervision  of  Dr.  Simon  Tyler. 

Accomplishments:  SHINE  Component  of  the  FAIL-SAFE  IFR 

The  FAIL-SAFE  IFR  works  by  monitoring  the  run-time  execution  of  an  application  and  the  system  it 
runs  on,  and  manages  their  responses  to  errors.  Whenever  an  error  occurred,  its  location,  type,  and 
severity,  will  all  be  communicated  to  the  IFR  which  will  analyze  errors  and  subsystem  failures,  and 
provide  feedback  to  the  application,  the  operating  system,  and  ultimately  the  developers  and  users 
of  the  application.  The  role  of  the  IFR  can  be  understood  as  encapsulating  functionality  related  to 
the  monitoring  and  analysis  of  special  events  that  occur  during  the  execution  of  the  application,  and 
reasoning  about  related  problems  and  recovery  strategies.  For  example,  the  IFR  may  react  to  failing 
correctness  assertions,  monitor  the  cache  misses  and  performance  characteristics  of  program 
sections,  and  reason  about  the  frequency  of  errors  over  time  and  their  correlation.  In  case  of  severe 
errors  it  may  negotiate  with  the  operating  system  about  an  appropriate  recovery  strategy. 

At  the  heart  of  the  IFR  is  the  Spacecraft  Health  Inference  Engine  (SHINE)  and  a  IFR  knowledge  base. 
In  principle,  this  capability  can  be  also  applied  to  areas  beyond  fault  tolerance,  such  as  performance 
tuning,  energy  and  power  management,  behavior  analysis,  and  intrusion  detection.  SHINE  was 
selected  because  it  was  specifically  designed  to  monitor  a  highly  instrumented  system  and  react  to 
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conditions  in  real-time.  SHINE  is  capable  of  executing  100,000,000+  inferences  per  second  where 
the  second  best  inference  engine  at  the  time  (CLIPS]  only  executed  40,000  rules  per  second.  In 
order  to  actively  monitor  applications  as  they  are  running  and  to  additionally  react  to  their 
behaviors,  the  CLIPS  engine  was  nowhere  near  fast  enough.  JPL  has  extensive  experience  using 
SHINE,  making  it  a  uniquely  valuable  team  member  for  the  FAIL-SAFE  project. 

SHINE  is  intended  for  those  areas  of  inference  where  speed,  portability,  and  reuse  are  of  critical 
importance.  Such  areas  have  historically  included  spacecraft  monitoring,  control  and  health, 
telecommunication  analysis,  medical  analysis,  finical  and  stock  market  analysis,  fraud  detection 
(e.g.  banking  and  credit  cards],  robotics  or  basically  any  area  where  rapid  and  immediate  response 
to  high-speed  and  rapidly  changing  data  is  required. 

SHINE  was  originally  designed  to  be  embedded  in  single-threaded  applications  where  the 
interruption  from  the  inference  cycle  would  only  occur  at  task  rescheduling  points,  where  the 
complete  system  state  was  saved  by  the  operating  system  and  only  a  single  SHINE  was  executing  at 
any  given  point.  It  was  not  intended  to  be  an  all-encompassing  expert  system  for  mutually 
cooperating  mini  SHINEs  all  running  simultaneously.  Because  of  this  assumption,  ah  the  rules  and 
states  of  a  knowledge  base  are  reduced  to  a  data  flow  graph  with  all  possible  dependency  paths 
calculated  in  advanced.  However,  the  FAIL-SAFE  IFR  required  a  far  most  robust  control  structure 
where  many  SHINEs  were  all  executing  in  true  parallelism  and  sharing  results  between  them.  This 
caused  the  state  predictions  by  the  compiler  to  be  voided  and  incorrect  inferences  to  be  made. 

SHINE  introduces  a  novel  paradigm  for  knowledge  visualization  and  ultra-fast  inference  that  goes 
well  beyond  traditional  forward  and  backward  chaining  methodology.  A  sophisticated 
mathematical  transformation  based  on  graph-theoretic  Data  Flow  analysis  is  introduced,  that 
reduces  the  complexity  of  conflict-resolution  during  the  match  cycle  from  0(n2]  to  0(n]  for  many 
kinds  of  inference  operations.  This  transformation  executes  compiled  SHINE  knowledge  bases  at 
an  excess  of  33,000,000  rules  per  second  on  flight  hardware  and  over  220,000,000  rules  per  second 
on  a  standard  3  GHz  desktop  PC. 

A  Data  Flow  program  consists  of  data  (1],  which  run  the  program,  operations  (2],  which  are 
activated  when  data  is  sent  to  them,  and  finally  the  results  (3],  which  is  what  the  program  will 
return  when  completed.  When  data  reaches  a  procedure  it  activates  or  "fires  off"  that  procedure. 
Originally  SHINE  only  contained  what  is  called  static  firing,  which  meant  a  fairly  broad  set  of  graph 
optimizations  could  be  applied  to  globally  optimize  the  flow  graph.  To  support  the  FAIL-SAFE  IFR 
domain,  we  fundamentally  changed  the  SHINE  compiler  to  include  two  different  ways  to  perform 
this  "firing":  [1]  Static  Data  Flow:  A  procedure  will  begin  when  a  piece  of  data  is  located  at  every 
input  edge  and  no  data  is  present  at  any  output  edge.  Only  one  piece  of  data  can  reside  on  each  edge 
and  (2]  Dynamic  Data  Flow:  Each  piece  of  data  has  some  way  of  identify  which  other  data  it  belongs 
with,  such  as  a  color,  and  when  ah  the  input  edges  contain  data  of  the  same  type,  the  procedure  will 
begin.  Any  input  or  output  edge  can  handle  multiple  pieces  of  data. 

A  high  priority  goal  for  FAIL-SAFE  was  to  avoid  introducing  an  overall  performance  loss  to  SHINE 
by  including  dynamic  firing  so  we  introduced  an  additional  analysis  phase  to  only  isolate  those 
portions  of  a  rule  set  that  needed  dynamic  firing  and  leave  the  rest  the  same.  Thus  the  dynamic 
firing  portions  are  compiled  with  a  thread-safe  set  of  optimization  rules  that  guaranteed  valid  state 
transitions  between  co-routine  activated  instances  of  SHINE. 

Overall,  we  found  Data  Flow  to  be  a  very  powerful  method  of  parallel  programming  or  representing 
knowledge  that  is  executed  in  pseudo  parallelism;  however  it  is  very  difficult  to  write  programs  in  a 
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Data  Flow  environment.  Un-optimized  Data  Flow  programs  require  a  lot  of  storage  and  during 
execution  there  are  scheduling  problems,  which  must  be  controlled  by  some  means  of  hardware  or 
a  software  executive.  As  a  result,  very  few  machines  have  been  developed  for  Data  Flow  and  the 
chances  are  that  very  few  will  be  developed  for  commercial  use  in  the  future  since  Data  flow  is 
expensive.  However,  Data  Flow  is  an  excellent  means  of  representing  information  that  can  be 
executed  in  pseudo  parallelism,  optimized,  and  then  mapped  to  a  traditional  architecture  for 
execution.  Obviously,  many  of  the  speed  advantages  of  a  true  hardware  Data  Flow  computer  cannot 
be  realized  when  executed  on  a  serial  architecture  but  it  still  offers  an  excellent  frame  work  to 
represent  parallel  algorithms  and  an  efficient  means  to  execute  them  over  traditional  methods 

Collaborations  and  Leveraged  Funding 

We  have  shared  out  draft  assertion  language  with  colleagues  at  Prof.  Sarkar  (Rice  University],  Dr. 
David  Bernholdt  (Oak  Ridge  National  Laboratory],  Dr.  Erik  Debenedictis  (Sandia  National 
Laboratory],  and  Marti  Bancroft  (DOD  contractor].  Saurabh  Hukerikar  also  spent  two  summers  as 
an  intern  at  Sandia  National  Laboratory,  and  collaborated  with  Robert  Clay. 

At  the  direction  of  the  Government,  we  have  also  explored  with  Dr.  Benedictis  and  other  of  USC's 
research  staff  the  trade-off  between  sub-threshold  voltage  computing  in  the  context  of  computer 
architecture  digital  circuits  and  resilience  for  increased  power  efficiency.  From  this  interaction 
resulted  various  internal  discussions  for  potential  future  actionable  items  in  this  area  of  research. 

While,  this  research  has  focused  exclusively  on  software  resilience,  the  USC  component  of  this  work 
builds  on  earlier  funding  form  the  Semiconductor  Research  Corporation  (SRC],  NSF,  DARPA  MTO 
(via  SRC],  and  DOE  ASCR's  SciDAC-3  Institute  for  Sustained  Performance,  Energy,  and  Resilience 
(SUPER].  Both  ROSE  and  SHINE  have  long  histories  of  support  and  use  from  DOE  and  NASA, 
respectively. 

Conclusions 

Prior  work,  funded  by  SRC,  revealed  that  there  are  applications  that  tolerate  faults  in  large  portions 
of  their  state  space,  and  continue  to  correct  results.  Our  goal  has  been  to  enable  users  to  share  this 
information  with  their  computing  systems,  and  hence  minimize  unnecessary  disruptions  caused  by 
soft  errors,  which  are  expected  to  be  increasingly  common  in  the  future.  We  have  made  significant 
progress  towards  that  goal.  For  example,  we  demonstrated  this  on  the  conjugate  gradient 
algorithm,  as  directed  by  the  government.  Of  course,  as  with  any  early  research  project,  we  also 
believe  there  is  much  more  work  to  be  done,  before  this  technology  can  be  integrated  into  the 
software  widely  used  by  Defense  computational  scientists. 

Technology  Transfer 

We  have  distributed  drafts  of  our  assertion  language  to  two  DOD  ACS  program  investigators,  Prof. 
Vivek  Sarkar  at  Rice  University  and  Dr.  David  Bernholdt  at  Oak  Ridge  National  Laboratory.  Per  the 
direction  of  John  Daly,  we  also  shared  it  with  Erik  deBenedictis  of  Sandia  National  Laboratory  and 
Marti  Bancroft,  a  contractor  for  another  DOD  organization.  We  demonstrated  resilient  execution  of 
the  HPCS  Random  Access  benchmark  at  the  DOD  ACS  workshop,  July  17,  2014.  We  have  also 
published  in  the  scientific  literature,  as  enumerated  below.  Finally,  we  have  installed  at  NSA  R3  an 
up-to-date  release  of  the  ROSE  Compiler  Infrastructure  and  an  implementation  of  a  selected  set  of 
the  ROSE-based  source-to-source  transformations  that  support  the  resilience  directives  developed 
as  part  of  the  FAI-SAFE  language  at  a  computer  cluster  at  the  direction  of  the  Government. 
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Future  Plans 


Our  longer  term  goal  is  to  broaden  the  space  of  faults  we  can  handle  without  halting  an  application 
or  the  computing  system  its  running  on.  We  will  would  like  to  fully  integrate  and  validate  in  a 
production  environment  the  research  artifacts  developed  here,  namely,  the  integration  of  the  ROSE 
compiler  transformations  with  the  user-provided  codes  in  collaboration  with  the  IFR  system  and 
release  it  as  open-source  for  DOD  and  the  broader  community.  We  are  searching  for  additional 
research  support  to  continue  these  efforts. 
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Preliminary  Note 


This  document  contains  the  specification  of  Version  8  of  the  assertion  language,  representing  a  major  re¬ 
vision  of  Version  7.0  distributed  on  June  20th,  2014.  It  includes  new  material  on  error  control,  introduces 
additional  constructs  and  streamlines  some  of  the  previously  introduced  concepts,  resulting  in  some  changes 
of  terminology. 

The  document  continues  to  use  an  ad-hoc  pseudo  syntax  whose  main  purpose  is  to  serve  as  a  framework 
for  explaining  the  semantics  of  the  assertion  language.  Any  mapping  of  that  syntax  to  a  host  language  that 
preserves  these  semantics  is  considered  valid. 

1  Introduction 

1.1  Fault  Tolerance 

Fault  tolerance  is  one  important  aspect  of  a  system’s  dependability,  a  property  that  has  been  defined  by 
the  IFIP  10.4  Working  Group  on  Dependable  Computing  and  Fault  Tolerance  as  the  “ trustworthiness  of  a 
computing  system  which  allows  reliance  to  be  justifiably  placed  on  the  service  it  delivers”. 

A  threat  is  any  fact  or  event  that  negatively  affects  the  dependability  of  a  system.  Threats  can  be 
classified  as  faults,  errors,  or  failures;  their  relationship  is  illustrated  by  the  fault-error-failure  chain  [1], 

A  fault  is  a  defect  in  a  system.  Faults  can  be  dormant — e.g.,  incorrect  program  code  that  is  not 
executed — and  have  no  effect.  When  activated  during  system  operation,  a  fault  leads  to  an  error,  which 
is  an  illegal  system  state.  A  fault  inside  a  component  is  called  internal ;  an  external  fault  is  caused  by  a 
failure  propagated  from  another  component,  or  from  outside  the  system.  Errors  may  be  propagated  through 
a  system,  generating  other  errors.  For  example,  a  faulty  assignment  to  a  variable  may  result  in  an  error  char¬ 
acterized  by  an  illegal  value  for  that  variable;  the  use  of  the  variable  for  the  control  of  a  for-loop  can  lead 
to  ill-defined  iterations  and  other  errors,  such  as  illegal  accesses  to  data  sets  and  buffer  overflows.  A  failure 
occurs  if  an  error  reaches  the  service  interface  of  a  system,  resulting  in  system  behavior  that  is  inconsistent 
with  its  specification. 

In  systems  with  an  exascale  computing  capability  we  must  assume  that  there  exist  faults  and  that  errors 
will  occur.  The  goal  of  the  FailSafe  system  is  to  achieve  fault  tolerance  by  ensuring  that  despite  the  potential 
existence  of  errors  the  system  will  never  enter  a  failure  state. 

1.2  Hard  Correctness  versus  Soft  Correctness 

The  traditional  approach  to  program  correctness  has  been  based  on  the  requirement  that  the  program  exe¬ 
cutes  perfectly  on  a  cycle-by-cycle  basis,  with  numerical  results  mathematically  precise  except  for  possible 
rounding  errors.  We  call  this  hard  correctness.  For  algorithms  such  as  sorting,  which  need  to  deliver  a 
well-defined  unique  result,  this  represents  the  only  valid  approach  to  deal  with  correctness.  However,  for 
many  algorithms  this  strict  requirement  can  be  relaxed  by  replacing  it  with  soft  correctness,  a  fidelity  con¬ 
cept  that  defines  the  correctness  of  a  result  based  on  a  user-perceived  quality  of  the  solution  that  tolerates 
some  errors  as  long  as  they  satisfy  an  associated  validity  criterion. 

Early  ideas  in  this  direction  originated  from  AI  in  the  context  of  “soft  computing”  [12];  however  the 
principal  idea  of  qualitatively  defining  correctness  at  the  algorithmic  level  applies  to  other  applications 
as  well,  with  examples  including  image  processing,  pattern-matching  algorithms  in  bioinformatics,  and 
iterative  numerical  algorithms  yielding  approximate  results  [7]. 
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The  adoption  of  soft  correctness  has  immediate  consequences  for  fault  tolerance.  Essentially,  it  leads  to 
the  notion  of  the  degree  of  correctness  of  an  algorithm,  which  is  determined  by  the  amount  of  tolerable  error. 
For  example,  some  errors  may  be  completely  ignored  (such  as  a  few  erroneous  pixels  in  a  large  image  file), 
some  errors  occurring  in  a  hierarchical  control  structure  may  be  corrected  at  a  higher  level  of  abstraction, 
and  even  certain  hardware  component  failures  may  be  tolerated  at  the  application  level. 

A  major  motivation  for  the  distinction  between  hard  and  soft  correctness  is  the  significantly  higher  over¬ 
head  in  terms  of  execution  time  and  energy  consumption  for  enforcing  hard  correctness.  As  a  consequence, 
it  is  often  useful  to  partition  a  program  into  high-reliability  and  low-reliability  sections.  In  high-reliability 
sections,  program  semantics  demand  a  strictly  enforced  hard  correctness  approach.  Reliability  in  such  sec¬ 
tions  can  be  supported  via  software  mechanisms  such  as  correctness  assertions,  introspection,  and  redundant 
execution.  In  contrast,  low-reliability  sections  can  tolerate  certain  classes  of  errors,  thus  significantly  reduc¬ 
ing  overhead.  Errors  occurring  in  such  sections  may  be  ignored  or  corrected  at  a  higher  level  of  abstraction, 
depending  on  algorithm-specific  validity  criteria.  “Sandboxing”  [3]  is  a  method  useful  for  certain  numerical 
applications  that  supports  such  a  distinction.  It  was  originally  used  in  computer  security  applications,  where 
an  untrusted  “guest  code”  is  executed  in  an  isolated  state  space  (the  “sandbox”),  which  protects  the  rest  of 
the  application  (the  “host”)  from  errors  occurring  during  execution  of  the  guest  code.  A  recent  case  study 
illustrates  this  approach  with  FT-GMRES,  a  fault-tolerant  Generalized  Medium  Residual  algorithm  for  solv¬ 
ing  non-symmetric  linear  systems:  the  guest  represents  an  unreliable  inner  solver  for  a  linear  system  A.x=y, 
with  a  given  termination  criterion.  The  outer  solver  checks  the  solution  vector  for  errors  and  determines 
if  the  result  converges  within  the  bounds  specified  by  the  algorithm;  minor  numerical  perturbations  arc  ig¬ 
nored.  If  the  solution  contains  errors  or  diverges,  it  is  disbanded  and  replaced  with  a  valid  value  determined 
by  information  in  the  state  space  of  the  host. 

1.3  Overview  of  the  Assertion  Language 

The  assertion  language  (AL)  provides  a  set  of  constructs  that  can  be  embedded  in  a  host  language  program 
for  the  purpose  of  checking  the  state  of  computations  during  runtime,  expressing  tolerance  for  certain  kinds 
of  program  errors,  and  specifying  relationships  that  can  be  used  for  dynamic  program  optimization. 

In  contrast  to  approaches  such  as  the  Java  Assertion  Language  [10],  AL  is  defined  independently  of  a 
particular  host  language.  However,  given  a  specific  host  language,  the  details  of  AL  semantics  will  to  a 
certain  extent  rely  on  host  language  features  such  as  expressions,  naming,  and  scoping  conventions. 

AL  constructs  fall  into  four  major  categories:  (1)  assertions,  (2)  tolerance  directives,  (3)  pragmas,  and 
(4)  error  control  features. 

Assertions  provide  a  means  to  express  the  knowledge  that  a  propositional  logic  predicate  over  data 
should  be  satisfied  at  certain  points  during  program  execution.  Predicates  arc  classified  as  either  status 
predicates  or  invariants.  Status  predicates  define  a  precondition  or  postcondition  for  individual  program 
statements,  whereas  invariants  must  be  satisfied  throughout  extended  regions. 

A  major  purpose  of  assertions  is  their  support  for  the  hard  correctness  of  an  algorithms.  For  example, 
assertions  used  in  the  context  of  a  sandbox  may  be  used  to  determine  if  the  potentially  erroneous  results 
of  computation  in  a  low-reliability  region  can  be  used  at  a  higher  level  of  abstraction.  Assertions  also 
provide  some  limited  support  for  a  Design  by  Contract  capability  [8,9].  Specifically,  status  predicates  allow 
reasoning  about  the  partial  correctness  of  program  components  via  Hoare  logic  [4] :  whenever  a  statement 
is  executed  in  a  state  that  satisfies  its  precondition  then,  if  the  execution  of  the  statement  terminates,  the 
postcondition  must  hold  immediately  after  its  termination. 
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Tolerance  directives  instruct  the  system  to  ignore  occurrences  of  certain  classes  of  errors  during  pro¬ 
gram  execution. 

Pragmas  can  provide  information  about  key  relationships  between  energy  consumption,  performance, 
and  reliability  that  can  guide  the  system  in  optimizing  an  objective  function  at  runtime,  depending  on  the 
environment  in  which  a  program  executes. 

Finally,  error  control  features  provide  tools  for  error  recovery  via  the  association  of  error-handling 
routines  with  AL  constructs  and  source  program  components. 

1.4  The  FailSafe  System 

Exascale  architectures  will  be  characterized  by  heterogeneous  processing  and  memory  structures,  millions 
of  components,  and  deep  memory  hierarchies.  Applications  executing  in  such  systems  will  need  to  balance 
the  quest  for  performance  optimization  and  minimization  of  energy  consumption  with  a  flexible  approach  to 
resilience  that  combines  the  enforcement  of  hard  correctness  for  critical  code  components  with  an  accept¬ 
able  degree  of  fault  tolerance  for  less  critical  code  sections.  The  salient  goal  of  the  assertion  language  is 
to  provide  information  that  supports  the  FailSafe  system,  as  outlined  in  Figure  1,  in  the  implementation  of 
strategies  for  the  optimization  of  objective  functions  in  this  complex  space. 

A  major  component  of  the  FailSafe  system  is  the  Introspection  Framework  for  Resilience  (IFR)  [6].  The 
IFR  encapsulates  functionality  related  to  the  monitoring  and  analysis  of  errors  and  other  special  events  that 
occur  during  program  excution  and  reasons  about  related  optimization  and  recovery  strategies.  At  its  heart 
is  a  powerful  inference  engine.  The  IFR  generates  information  for  the  Knowledge/Experience  Database 
(KEX),  which  in  an  advanced  version  will  include  knowledge  about  the  application  domain,  properties  of 
the  underlying  hardware  and  software  system,  strategic  optimization  goals,  and  behavioral  data  extracted 
from  actual  executions.  The  information  in  the  KEX  evolves  dynamically  as  an  application  is  subject  to 
iterative  optimization  steps  during  multiple  compile/execute  cycles;  its  input  to  the  user,  or  to  a  future  auto¬ 
mated  dynamic  model  evaluator,  provides  the  opportunity  to  adapt  the  annotation  of  the  source  program  for 
a  new  compile/execution  iteration  by  taking  into  account  the  results  of  previous  iterations  and  their  analysis. 

One  of  the  implicit  assumptions  when  discussing  assertion  languages  is  often  that  the  management  of 
its  constructs  is  exclusively  in  the  hands  of  the  user.  However,  the  above  outline  of  IFR  functionality  in  the 
context  of  KEX  and  powerful  methods  for  compilation  and  runtime  analysis,  suggests  that  the  compiler  as 
well  as  the  IFR  and  the  runtime  system  may,  in  a  future  system,  play  an  important  role  for  the  generation, 
management,  and  execution  control  of  assertion  language  constructs,  either  autonomously  or  in  cooperation 
with  the  user. 

2  Regions 

2.1  Syntax 

1.  region  — »•  program-region  \  data-region 

2.  program-region  — >  single-point-region  \  composite-region 

3.  single-point-region  — >  statement-label  \  function- identifier 
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4.  composite-region  — function-region  |  assertion-region 

5.  assertion-region  — >  statement-label  region”  [ error-handler ]  “{”  statement-sequence 

6.  data-region  — >  declaration-label  \  variable-identifier  \  memory -region-identifier 

7.  aggregate-region  — »  aggregate-single-point-region  \  aggregate-data-region  \  aggregate-program-region 

8.  aggregate-single-point-region  — >  “  in  (”  single-point-region-list  “)” 

9.  aggregate-data-region  -y  “  in  (”  data-region-lisf‘)”  [ region-constraint ] 

10.  region-constraint  — >  modifier  “(”  composite-region-list  “)”  |  other-constraint 

11.  modifier  “excluding”  |  “  restricted  to ” 

12.  aggregate-program-region  — >•  “in(”  composite -region-list“)' ’  [ region-constraint ] 

2.2  Regions  and  Their  Association  With  Assertion  Language  Constructs 

Each  defining  occurrence  of  an  AL  construct  associates  it  with  one  or  more  regions,  which  we  call  the  target 
regions  of  the  construct.  Regions  exist  in  space  and  time:  they  are  connected  to  certain  components  of  the 
source  program,  such  as  variable  or  function  declarations,  and  their  period  of  existence  is  determined  by  the 
way  in  which  these  components  are  used  during  program  execution.  We  use  the  term  lifetime  to  refer  to  this 
period  both  for  regions  as  well  as  for  their  associated  AL  constructs. 

A  region  is  either  a  program  region  or  a  data  region.  Program  regions  can  be  single  point  or  composite 
regions.  A  single  point  region  is  either  defined  by  a  statement  label  or  by  a  function  identifier.  In  the  first 
case,  it  identities  the  statement  determined  by  that  label;  in  the  second  case,  it  represents  all  statements  that 
invoke  the  function.  If  S  denotes  such  a  statement,  then  it  is  associated  with  two  single  points  of  execution: 
its  pre-execution  point  denotes  the  point  immediately  before  an  execution  of  S,  and  its  post-execution  point 
the  point  immediately  following  an  execution  of  S. 

In  contrast  to  single  point  regions,  composite  regions  arc  defined  by  contiguous  sets  of  statements.  They 
can  be  function  regions  or  assertion  regions.  The  lifetime  of  a  function  region  is  determined  by  the  period 
of  time  in  which  the  program  executes  any  invocation  of  that  function.  An  assertion  region  is  a  labeled 
contiguous  block  of  source  code  statements  marked  by  the  region  keyword  and  enclosed  between  “{”  and 
Assertion  regions  must  satisfy  the  single-entry  condition.  Their  lifetime  is  any  period  of  time  in  which 
program  execution  resides  in  that  region. 

A  data  region  is  either  a  declaration  region,  identified  by  the  name  of  a  source  language  declaration  , 
or  a  variable  region,  identified  by  the  name  of  a  single  variable.  If  v\, . . , ,  vn  arc  the  variables  declared  in  a 
declaration  region,  then  it  represents  the  set  of  all  variable  regions  for  the  vt,  1  <  i  <  n. 

During  program  execution,  a  variable  region  will  result  in  the  creation  of  a  set  of  data  objects.  Depend¬ 
ing  on  the  declaration  of  the  variable  such  objects  may  be  of  a  simple  type,  pointers,  or  aggregates  such  as 
arrays,  structures  or  class  instances.  The  lifetime  of  a  variable  region  is  the  period  of  time  during  which 
any  associated  objects  exist  during  program  execution.  The  lifetime  of  a  declaration  region  is  the  union  of 
the  lifetimes  of  the  associated  variable  regions.  For  tolerance  directives  (see  Section  5),  specially  managed 
error- tolerant  memory  regions  arc  also  considered  as  data  regions. 


1  Here  we  assume  that  the  language  provides  syntax  for  labeling  a  declaration. 
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The  association  of  an  AL  construct  with  its  target  regions  can  be  established  in  two  ways:  first,  by 
context,  i.e.,  by  placing  the  construct  in  the  source  program  text  adjacent  to  a  program  object,  which  then 
becomes  its  single  target  2 .  Secondly,  as  a  syntactic  convenience,  the  defining  occurrence  of  an  AL  construct 
can  be  specified  separately  from  its  target  regions  in  the  source  program  by  providing  an  explicit  link  to  its 
associated  aggregate-region.  In  general,  such  a  link  is  of  the  form: 


in  (Ri,  ...Rn)  [C] 


where  n  >  1,  R.\ ,  ...Rn  denote  regions,  and  C  is  an  optional  region  constraint.  All  regions  Il  \ .  ...Rn  must 
be  of  the  same  type,  i.e.,  single-point  regions,  composite  regions,  or  data  regions.  A  region  constraint  can  be 
used  to  exclude  some  implicit  components,  such  as  nested  functions  or  assertion  regions,  from  the  aggregate 
region.  For  example: 

float  A,  B,  C; 

atype  function  f(...)  {...}; 

assert  (p(  allvars ))  in(A,C)  excluding  (f); 

Here,  the  invariant  predicate  p  is  asserted  to  hold  for  the  assertion’s  target  variables  A  and  C  throughout 
their  lifetimes,  except  when  program  execution  resides  in  an  invocation  of  function  /  (allvars  refers  to 
every  variable  with  which  the  assertion  is  associated). 

The  detailed  rules  governing  the  association  of  AL  constructs  with  their  targets  will  be  discussed  in  the 
relevant  sections. 


3  Error  Control 

3.1  Syntax 

1.  error-control  — >  error-handler  \  redundancy -management-directive 

2.  error-handler  — >  basic-error-recovery  \  standard-error-recovery  \  composite-region-error-recovery 

3.  basic-error-recovery  — >■  “  error  (”  basic-error-handler  “)” 

4.  basic-error-handler  —y  function-invocation 

5.  standard-error-recovery  — >  basic-error-recovery  |  “  error  C' class- specific- recove ry+  [else-clause\  ”)” 

6.  class-specific-recovery  —t  error-class  >”  basic-error-handler“;” 

7.  error-class  numericaLerror  |  NaN  |  Inf  \  floating-point-representation-error  \  integer-representation- 
error  |  SECDED  |  crash  |  predicate  .violation  |  tolerance  violation  |  other-predefined-error-class  \ 
user-defined-error-class  \  ANY 

8.  else-clause  — >  “else”  basic-error-handler 

2The  detailed  rules  for  contextual  association  are  implementation  dependent. 
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9.  composite-region-error-recovery  — >  standard-error-recovery  repetition-directive  [save- directive] 

10.  repetition-directive  -y  “  repeat  (” rep-count  restore-directive]  final-error-recovery]  “)” 

1 1 .  rep-count  -y  integer-expression 

12.  restore-directive  — >  “  restore  (”  data-region-list  “)” 

13.  final-error-recovery  — >  “  finally  (”  basic-error-handler  “)” 

14.  save-directive  -y  “save  (”  data-region-list  “)” 

15.  redundancy-management-directive  — >  redundancy -degree  decision-criterion  [standard-error-recovery] 

16.  redundancy-degree  “double”  |  “triple” 

17.  decision-criterion  -y  “(”  variable-list  “)” 

3.2  Overview 

Each  error  occurring  during  the  execution  of  a  program  belongs  to  a  specific  error  c/av.v.  For  system-defined 
error  classes,  such  as  numerical  .error ,  default  error  handlers  are  provided.  Future  versions  of  AF  are 
expected  to  provide  a  capability  for  user-defined  error  classes  and  their  management. 

Error  control  in  AF  includes  two  sets  of  features:  the  first  set,  subsumed  under  the  term  error  handler , 
allows  the  specification  of  recovery  mechanisms  for  errors  occurring  in  the  associated  AF  or  source  program 
construct.  Three  levels  of  increasing  complexity  are  distinguished:  basic  error  handlers  deal  with  an  error 
by  calling  a  function  that  executes  a  recovery  routine.  This  is  the  only  available  option  for  status  predicates. 
Standard  error  recovery  includes  basic  error  handlers  plus  a  capability  for  selecting  an  error  response  de¬ 
pending  on  the  error  class;  it  can  be  used  in  the  context  of  data  regions  and  composite  regions.  Finally, 
composite  region  error  recovery  supports  additional  features  specially  oriented  towards  composite  regions 
including  repeated  execution  of  the  region  and  control  for  the  saving  and  restoring  of  data  structures. 

The  second  set  of  error  control  features  of  AF  is  focused  on  preventing  and/or  correcting  errors  occurring 
in  composite  regions.  It  provides  redundancy  management  directives  that  execute  a  composite  region  two 
or  three  times  and  evaluate  the  correctness  of  results  based  upon  a  decision  criterion. 

3.3  Error  Classes 

Below  we  characterize  briefly  some  of  the  system-defined  error  classes. 

1 .  numericaLerror :  this  is  the  class  of  all  kinds  of  numerical  errors,  including  the  four  classes  descibed 
below. 

2.  NaN :  In  the  IEEE-754  standard,  NaN  (Not  a  Number)  represents  a  non-numerical  value  occurring  in 
a  context  that  requires  a  number. 

3.  Inf :  In  the  IEEE-754  standard,  Inf  (Infinity)  represents  the  occurrence  of  an  overflow  in  a  numerical 
computation. 

4.  Floating-Point  Representation  Error 


5.  Integer  Representation  Error 

6.  SECDED  :  Single-error  correction,  double-error  detection. 

7.  crash  :  Processor  crash. 

8.  predicate -violation :  Evaluation  of  a  status  predicate  or  invariant  during  its  lifetime  yields  false  (see 
Section  4). 

9.  tolerance -violation  :  Evaluation  of  a  tolerance  clause  during  its  lifetime  yields  false  (see  Section  5). 

10.  ANY  :  An  error  belonging  to  any  class. 

3.4  Error  Handler 
3.4.1  Basic  Error  Recovery 

Basic  error  recovery  results  in  the  invocation  of  a  basic  error  handler,  which  is  a  system-  or  user-defined 
function.  This  function  call  may  contain  arguments  that  provide  a  more  detailed  characterization  of  the  error 
as  well  as  additional  information  about  program  components  that  may  have  contributed  to  the  error,  such  as 
key  variables  involved  in  the  erroneous  computation. 

Example 


x=f(y)  assert  ( pre  (y  <  U))  error  (range  _violation(y,U)); 

In  this  example,  the  predicate  y  <  U  is  a  precondition  associated  with  the  assignment  statement  x  =  f(y), 
which  needs  to  be  satisfied  immediately  before  an  execution  of  the  assignment.  Its  violation  results  in  the 
invocation  of  the  basic  error  handler  range  .violation. 

3.4.2  Standard  Error  Recovery 

Standard  error  recovery  can  be  used  to  address  errors  that  occur  during  the  execution  of  a  data  or  composite 
region.  In  addition  to  basic  error  recovery  it  provides  a  capability  for  specifying  recovery  depending  on  the 
error  class: 

error  (  c\  — >  hi, 

C-n  t  hn , 

[else  hn+ 1]) 


where, 

•  n  >  1  is  the  number  of  error  classes  for  which  recovery  is  specified  explicitly 

•  Ci,  1  <  i  <  n,  denotes  an  error  class 

•  h  j  ,  1  <  i  <  n,  denotes  a  basic  error  handler  associated  with  error  class  c, 
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•  hn+ 1  denotes  the  optional  basic  error  handler  for  any  error  with  a  class  different  from  all  of  the 

Ci,  1  <  i  <  n 

Assume  R  to  be  the  region  with  which  this  construct  is  associated,  and  that  an  error  of  class  c  has  been 
detected  during  execution  in  R.  Then  c  is  sequentially  compared  with  ci,  C2, ....  If  Cj,  1  <  i  <  n,  matches  c, 
then  hi  will  be  activated.  If  c  is  different  from  all  c,,  1  <  i  <  n,  and  an  else  clause  is  specified,  then  handler 
hn. |-i  will  be  activated.  In  the  absence  of  an  else  clause  a  system-specific  response  will  be  initiated  in  this 
case  3. 


3.4.3  Composite  Region  Error  Recovery 

A  composite  error  recovery  specification  is  of  the  form: 

error  (  ci  -A  h\\ 

Cn  ^  hn , 

[else  hn+ 1]) 

repeat  (maxrep  [,  restore  (vJt ,  ■■■V]p)\  [,  finally  (h)]) 
[save Oi,  ...vm)] 


where. 


•  n,  Ci,  and  ht  have  the  same  meaning  as  above 

•  maxrep  is  an  integer  expression  with  a  value  >  1  that  specifies  the  maximum  of  repetitions  of  It  to  be 
performed  in  the  case  of  error. 

•  the  Vjl ,  ...Vjp  in  the  optional  restore  directive  denote  a  subset  of  the  variables  v\ ,  ...vm  specified  in  the 
associated  save  directive 

•  h  denotes  a  basic  error  handler 

•  Vj,  1  <  j  <  m,  with  m  >  1,  in  the  optional  save  directive  specifies  a  program  variable 

This  construct  is  executed  as  follows: 

1 .  R  is  initially  activated,  after  executing  the  save  directive  if  present. 

2.  If  the  execution  of  It  is  error-free,  normal  program  execution  continues  with  the  statement  following 
R. 

3.  If  during  the  execution  of  R  an  error  occurs,  then  a  response  as  described  above  in  Section  3.4.2  is 
initiated.  If  this  response  is  able  to  successfully  process  the  error  in  the  sense  that  an  effect  equivalent 
to  a  correct  execution  of  R  is  achieved,  normal  program  execution  is  continued  with  the  statement 
following  R. 

Otherwise,  a  repeated  execution  of  R  is  initiated,  after  restoring  the  values  of  variables  specified  in 
the  restore  directive,  if  present.  If  a  restore  directive  is  not  specified,  but  a  save  directive  has  been 
executed  at  the  beginning,  all  variables  specified  in  the  save  directive  will  be  restored  at  that  point. 

3The  semantics  of  this  construct  is  slightly  different  if  used  in  the  context  of  tolerance  directives,  see  Section  5. 
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4.  Step  3  is  repeated  until  It  has  been  executed  at  most  maxrep  times.  If,  at  the  maxrep- th  execution,  no 
error  occurs,  normal  program  execution  continues.  Otherwise,  if  present,  the  handler,  h,  specified  in 
the  final  error  recovery  is  activated.  If  no  final  error  recovery  is  specified,  a  system  dependent  error 
handler  will  be  invoked  at  this  point. 

If  a  save  directive  is  paid  of  the  construct,  then  the  variables  Vj ,  1  <  j  <  m,  will  be  saved  by  the  system 
immediately  before  the  execution  of  R  is  initiated  4. 

Example 


Rl:  region  R1  error  (SECDED  — >  secded_handler(...); 

numericaLerror  — >  numeric Jiandler (...); 
else  other_handler(...)) 
repeat  (2,  finally  (final_handler(...))); 

{S1;S2;...} 

If  an  error  occurs  during  the  initial  execution  of  region  Rl,  the  following  actions  arc  performed: 

•  If  the  error  is  of  class  SECDED  or  numericaLerror ,  then  seeded  Jiandler  or  numeric  Jiandler  will  be 
respectively  activated.  If  an  error  of  a  different  class  occurs,  then  other  Jiandler  will  be  activated.  If 
the  error  handler  can  repair  the  problem,  then  normal  execution  continues  with  the  statement  following 
Rl.  Otherwise,  the  execution  of  Rl  is  repeated. 

•  After  the  first  repeated  execution  of  Rl,  the  same  actions  as  in  Step  1  above  are  taken. 

•  If  a  second  repeated  execution  of  Rl  is  necessary,  and  no  error  occurs  during  that  execution,  normal 
program  execution  continues.  If  an  error  occurs  during  that  execution,  the  basic  error  handler  final- 
handi  erf... )  is  activated. 

3.5  Redundancy  Management  Directives 

Error  control  features  in  AL  include  a  preliminary  version  of  redundancy  management  directives  (RMD)s, 
which  control  the  redundant  execution  of  a  composite  region,  R,  subject  to  two  constraints  .  (1)  It  satisfies 
the  single-entry  single-exit  condition,  and  (2)  the  execution  of  R  does  not  result  in  any  side  effects  for  non¬ 
local  variables.  A  region  satisfying  these  constraints  is  called  an  RMD  region ;  it  can  be  associated  with  an 
optional  standard  error  recovery  construct. 

A  key  component  of  an  RMD  is  the  decision  criterion :  it  specifies  a  nonempty  list  of  decision  variables 
whose  values  at  the  end  of  redundant  executions  determine  the  decision  on  the  correctness  of  the  region’s 
execution. 

Two  kinds  of  RMDs  arc  distinguished:  double  redundancy  directives,  and  triple  redundancy  directives. 
Let  R  denote  a  composite  region  to  which  one  of  the  directives  is  applied.  Then  the  effect  of  RMDs  can  be 
described  as  follows: 

4The  execution  of  the  save  directive  may  result  in  considerable  overhead  in  terms  of  performance,  memory  use,  and  energy 
consumption,  in  particular  if  large  data  structures  need  to  be  copied.  This  problem  may  be  alleviated  by  providing  more  flexibility 
for  the  choice  of  data  to  be  saved  and/or  restored  via  additional  control  features. 

5Loosening  these  restrictions  may  be  possible  in  certain  contexts.  This  applies  specifically  to  the  single-exit  constraint,  which 
is  not  required  for  extended  basic  blocks. 
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1.  Double  Redundancy  Directive:  R  will  be  executed  twice,  applied  to  the  same  input  and  without  any 
intermediate  computations  affecting  the  results  of  these  executions. 

If  at  least  one  of  these  executions  encounters  an  error  before  terminating,  then  the  execution  of  R  is 
considered  erroneous,  and  a  recovery  action  is  taken  upon  the  first  detection  of  such  an  error. 

If  both  executions  of  R  terminate  without  error,  and  the  values  of  all  decision  variables  after  termina¬ 
tion  arc  identical  for  both  executions,  then  the  execution  of  R  is  considered  error-free.  Otherwise,  the 
execution  of  R  is  considered  erroneous. 

2.  Triple  Redundancy  Directive:  R  will  be  executed  thrice,  applied  to  the  same  input  and  without  any 
intermediate  computations  affecting  the  results  of  these  executions. 

If  at  least  two  executions  of  R  encounter  an  error  before  terminating,  then  the  execution  of  R  is 
considered  erroneous,  and  a  recovery  action  is  taken  upon  the  first  detection  of  such  an  error. 

If  at  least  two  executions  terminate  without  error,  and  the  values  of  all  decision  variables  after  termi¬ 
nation  arc  identical  for  both  executions,  then  the  execution  of  R  is  considered  error-free.  Otherwise, 
the  execution  of  R  is  considered  erroneous. 


4  Assertions 

4.1  Syntax 

1.  assertion  — >  assert-clause  [ assertion-target ]  [ error-handler ] 

2.  assert-clause  — >  “  assert  (”  predicate-list  “)” 

3.  predicate  — >  status-predicate  \  invariant 

4.  status-predicate  — >  precondition  \  postcondition 

5.  precondition  — >  “pre(”  boolean-expression  “)” 

6.  postcondition  — >  “  post  (”  boolean-expression  “)” 

7.  invariant  — >  boolean-expression 

8.  assertion-target  — >  aggregate-region 

4.2  Overview 

The  specification  of  an  assertion  consists  of  up  to  three  components:  (1)  a  mandatory  assert  clause,  (2)  an 
optional  assertion  target,  and  (3)  an  optional  error  handler. 

An  assert  clause  contains  between  one  and  three  predicates,  including  at  most  one  precondition,  one 
postcondition,  and  one  invariant.  If  an  error  handler  is  provided  for  an  assertion,  it  applies  to  all  its  predi¬ 
cates. 
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4.3  Assert  Clauses 

An  assert  clause  consists  of  the  keyword  assert  followed  by  a  parenthesized  list  of  predicates,  which  arc 
propositional  logic  expressions  over  data.  Depending  on  how  a  predicate  is  used,  we  distinguish  between 
status  predicates  and  invariants.  A  status  predicate  is  associated  with  single  point  regions;  it  can  be  either  a 
precondition  or  a  postcondition  and  must  be  satisfied  respectively  at  the  pre-execution  or  the  post-execution 
points  of  these  regions  (see  Section  2).  An  invariant  can  be  associated  with  either  data  regions — and  called 
a  data  invariant  in  this  case — or  with  composite  regions — and  called  a  program  invariant. 

The  core  component  of  a  predicate  is  a  boolean  expression  (which  in  some  cases  may  be  tweaked  due  to 
special  context  requirements).  We  call  this  expression  an  assertion  expression.  It  may  contain  pure  functions 
and  must  be  side-effect  free. 

Let  E  denote  the  an  assertion  expression,  and  let  a  predicate  application  location  (PAL)  be  any  location 
in  the  program,  where  E  can  be  legally  applied  during  execution  within  the  assertion’s  lifetime.  For  all 
identifiers  referenced  in  the  assertion — including  identifiers  in  E,  region  identifiers,  and  any  identifiers 
referenced  in  an  error  handler — the  links  to  their  defining  occurrences  arc  established  at  the  point  of  the 
assertion’s  defining  occurrence.  If  loc  is  an  arbitrary  PAL,  then  all  these  links  must  be  defined  at  loc,  and  the 
value  of  E  is  computed  using  the  values  of  its  identifiers  at  this  location.  If  that  evaluation  yields  true ,  the 
predicate  is  said  to  be  satisfied  at  loc,  otherwise  it  is  said  to  be  violated  at  loc,  resulting  in  an  error  of  class 
predicate  .violation .  If  any  of  the  variable  links  in  E  is  undefined  at  loc,  or  if  a  variable  reference  yields  a 
corrupted  value,  then  the  application  of  the  predicate  in  this  location  is  undefined  and  the  expression  cannot 
be  properly  evaluated,  resulting  in  an  error.  This  class  of  error  is  a  special  case  of  a  predicate  violation . 

4.4  Rules  for  the  Association  Between  Predicates,  Regions,  and  Error  Handlers 

The  association  between  predicates,  associated  regions,  and  error  handlers  is  subject  to  the  following  rules: 

1 .  Status  Predicates:  An  assertion  specifying  a  status  predicate  can  only  be  associated  with  an  aggregate 
single-point  region  and  a  basic  error  handler. 

For  any  single-point  region,  at  most  one  precondition  and  at  most  one  postcondition  can  be  specified. 

2.  Data  Invariants:  An  assertion  specifying  a  data  invariant  can  only  be  associated  with  an  aggregate 
data  region.  It  can  be  combined  with  standard  error  recovery. 

Any  data  region  can  be  associated  with  at  most  one  data  invariant. 

3.  Program  Invariants:  An  assertion  specifying  a  program  invariant  can  only  be  associated  with  an 
aggregate  program  region.  It  can  be  combined  with  a  composite  region  error  recovery. 

Any  composite  region  can  be  associated  with  at  most  one  precondition,  at  most  one  postcondition, 
and  at  most  one  program  invariant. 

Multiple  predicates  for  one  assertion  target  can  be  specified  in  one  or  more  assertions.  Splitting  a 
multiple-predicate  assertion  into  separate  assertions  allows  the  violation  of  different  predicates  to  be  recov¬ 
ered  by  different  error  handlers. 
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4.5  Built-In  Predicates  for  Error  Detection 


4.5.1  Check  Predicate 

Consider  an  invariant,  assert  (E),  associated  with  a  variable,  v.  If  nothing  else  is  said,  the  system — i.e., 
the  compiler  and  runtime  system — is  responsible  for  validating  the  invariant  throughout  the  lifetime  of  the 
assertion.  Depending  on  the  context  in  which  the  assertion  is  used,  this  may  result  in  a  significant  runtime 
overhead.  AL  provides  a  way  to  explicitly  specify  at  which  points  during  program  execution  the  assertion 
expression  needs  to  be  evaluated  via  a  built-in  function  check  predicate  (v),  whose  invocation  results  in  the 
evaluation  of  E.  This  function  can  be  used  in  a  status  predicate  as  shown  in  the  example  below: 

atype  v  assert  (E)  no  .check  ; 

S:  v=g(x,y,z)  assert  ( post  ( check.predicate  (v)))  error  (h(...); 

The  above  declaration  defines  an  invariant  assertion,  assert  (E),  that  is  contextually  associated  with  the 
declaration  of  variable  v.  The  keyword  no  check  attached  to  this  assertion  is  meant  to  suppress  its  automatic 
validation  by  the  system. 

The  status  predicate  associated  with  statement  S  uses  the  built-in  function  check  predicate  (v)  as  a 
postcondition.  This  function  is  invoked  immediately  after  the  completion  of  the  assignment  v  =  g(x,  y,  z)\ 
it  results  in  the  evaluation  of  the  boolean  expression  E  at  that  point. 

Note  that  since  AL  requires  at  most  one  invariant  to  be  associated  with  a  data  region,  the  reference  to  v 
in  the  status  predicate  determines  this  invariant  unambiguously. 

4.5.2  Check  .Tolerance 

A  similar  function  is  provided  for  the  explicit  validation  of  a  tolerance  expression  associated  with  a  variable, 
v  (see  Section  /reftolerance).  It  is  invoked  in  the  form 

check.tolerance  (v) 

and  yields  true  iff  the  tolerance  expression  for  v  is  satisfied. 

4.5.3  Check_Error 

An  invocation  of  the  built-in  predicate  check.error  (c),  where  c  denotes  an  error  class,  results  in  a  value  of 
true  iff  at  the  time  of  execution  an  error  of  class  c  has  been  detected  by  the  system. 

4.6  Examples 

Status  Predicate:  Example  1 


x=fO(y)  assert  ( post  (x  <  U))  error (hl(x,y,U)); 
LI:  z=g(x); 
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This  example  shows  an  assertion  associated  by  context  with  the  single-point  region  defined  by  the  assign¬ 
ment  statement  x  =  fO(y).  The  assertion  specifies  a  postcondition,  i.e.,  the  assertion  expression  x  <  U 
needs  to  be  evaluated  immediately  after  the  execution  of  the  assignment  x  =  fO(y),  and  before  the  state¬ 
ment  z  =  g(x)  at  label  LI  is  executed.  In  the  case  that  the  assertion  fails,  the  basic  error  handler  hl(x,y,  U) 
is  activated. 

If  LI  is  the  target  of  a  control  transfer,  the  above  assertion  may  or  may  not  be  satisfied  when  control  is 
transferred  to  LI  from  an  outside  location.  In  order  to  assure  that  the  assertion  is  always  satisfied  before 
execution  of  the  statement  labeled  by  LI,  in  whatever  way  this  statement  is  reached,  the  assertion  would 
have  to  be  formulated  as  a  precondition  for  that  statement,  as  shown  below: 

x=fO(y); 

LI:  z=g(x)  assert  (pre(x  <  U))  error  (hi (x,y,U)); 


The  above  program  could  also  be  formulated  by  using  an  explicit  association: 

x=fO(y); 

LI:  z=g(x); 

assert  ( pre  (x  <  U))  error  (hi (x,y,U))  in  (LI); 


Status  Predicate:  Example  2 

float  function  fl(  float  x,y,z); 

{••• }; 

assert  ( pre  (p(a,b,c)),  post  (q(v, ...)))  in  (L2); 

L2:  v=fl(a,b,c); 

In  this  example,  two  status  predicates  -  a  precondition  as  well  as  a  postcondition  -  arc  explicitly  associated 
with  the  single  point  region  identified  by  the  label  L2. 

Now  consider  a  slightly  more  general  example  by  explicitly  associating  a  postcondition,  u,  with  the 
function,  /l.  This  postcondition  applies  to  all  invocations  of  / 1,  and  therefore  specifically  to  its  invocation 
within  the  assignment  at  label  L2: 

float  function  fl(  float  x,y,z);  assert  ( post  (u(. ..))); 

{•••}; 

assert  ( pre  (p(a,b,c)),  post  (q(v, ...)))  in  (L2); 

L2:  v=fl(a,b,c); 

Now  the  postcondition  «(•••)  will  be  evaluated  immediately  after  completion  of  the  invocation  /l(a,  b ,  c). 
Following  that,  after  execution  of  the  assignment  to  v,  the  postcondition  for  the  call,  q(v, ...),  will  be  evalu¬ 
ated. 
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Status  Predicate:  Example  3 


assert  ( post  (A(i)<  B(i)))  error  (h3(i,A(i),B(i)))  in(LL); 

LL:  CALL  MPLREC V ( A(i),  1  ,MPI_REAL,myrank+ 1 

Here  a  status  predicate  is  associated  explicitly  with  the  statement  identified  by  the  label  LL,  which  is  an  MPI 
library  call  that  receives  the  value  of  A(i).  The  expression  A(i)  <  B(i)  is  a  postcondition  that  must  be 
satisfied  immediately  after  the  execution  of  that  statement.  For  the  case  that  it  fails  the  error  handler  h.3 (...) 
will  be  activated. 

Status  Predicate:  Example  4 
BB1:  region 

{  /*  basic  block  statement  sequence  */  }; 

assert  ( post  ((nl  <  upb)  A  (v2  >  hub)  A  (v2  <  nl)))  in(BBl); 

Here,  the  assertion  expression  {vl  <  upb )  A  (v2  >  hub)  A  (v2  <  v  1 )  serves  as  a  postcondition  for  the 
execution  of  the  assertion  region  labeled  by  BB 1 . 

Status  Predicate:  Example  5 

LOOP1:  while  cexpr 

{ 

/*  basic  block  statement  sequence  */ 
assert  (post (a;  <  y))  error (...); 

} 

In  this  example,  the  postcondition  x  <  y  is  textually  associated  with  the  last  statement  of  the  while  loop’s 
body.  It  will  be  evaluated  every  time  the  body  of  the  while  loop  is  executed,  immediately  after  the  execution 
of  the  last  statement  in  the  body.  As  a  consequence,  this  expression  represents  an  invariant  that  must  be 
satisfied  after  each  iteration  of  the  loop,  and  after  its  termination.  Note,  however  that  this  predicate  need  not 
be  satisfied  within  the  loop  body. 

Status  Predicate:  Example  6 

LI:  x=g(y); 

assert  (pre  (pi (x)))  error  (...)  in  (LI); 

assert  ( pre  (p2(y)))  error  (...);  in  (LI)  /*  ILLEGAL  */ 

This  is  illegal,  since  no  two  preconditions  can  be  associated  with  a  single  program  region  (see  Section  4.4). 
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Invariant:  Example  1 


R:  region  assert  (x  <  y) 

{  while  cexpr 

{ 

/*  basic  block  statement  sequence  */ 

} 

} 

The  difference  between  this  example  and  Status  Predicate  Example  5  is  that  here  the  while  loop  has  been 
enclosed  in  an  assertion  region,  R,  and  that  the  post  condition  for  the  last  statement  has  been  replaced  by  an 
invariant  that  needs  to  be  satisfied  throughout  the  region. 

Invariant:  Example  2 


class  cyclic  .buffer 
{  int  n; 
double  B(n); 
int  c_cycles,  p_cycles; 


} 

assert ((c-cycles  <  p -cycles)  A  (p -cycles  <  c-cycles  +  n))  error  (sync_error_handler(...)); 

The  assertion  in  this  example  specifies  an  invariant  contextually  associated  with  the  class  cyclic. buffer.  Its 
lifetime  is  the  union  of  lifetimes  of  objects  that  arc  instantiations  of  cyclic -buffer.  The  invariant  expresses  a 
standard  constraint  for  producer  and  consumer  processes  operating  asynchronously  on  a  cyclic  buffer,  B(n), 
where  n  is  the  size  of  the  buffer  and  c-cycles  and  p-cycles  respectively  represent  the  number  of  cycles  the 
consumer  and  producer  have  already  executed. 

Assume  now  that  statement  S  below  has  updated  c.cycles,  as  a  consequence  of  consuming  an  item 
from  the  buffer.  Then  the  postcondition  associated  with  S  performs  an  explicit  check  of  the  invariant  for 
cyclic-buf  fer  defined  above: 

S:  newJtem  =  consume_an_item(...)  assert  ( post  ( check_predicate  (cyclic .buffer)))  error  (h(...)); 

The  basic  error  handler  h(...)  is  activated  if  the  post  condition  for  statement  S  is  violated.  This  overrides 
the  error  handler,  sync -error  Jiandler(...),  provided  for  the  invariant  in  the  context  of  the  class  declaration. 

Invariant:  Example  3 


int  ncycles;  assert  (0  <  ncycles  < 


maxcycles)'. 
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Here  we  use  a  contextual  association  for  the  specification  of  an  invariant  for  the  variable  ncycles  that  ex¬ 
presses  bounds  for  its  values  that  must  be  respected  throughout  the  lifetime  of  the  variable.  This  could  be 
rewritten  using  an  explicit  association: 

int  ncycles; 

assert  (0  <  ncycles  <  maxcycles )  in  (ncyles); 

Below  we  modify  this  example  by  limiting  the  lifetime  of  the  invariant  to  the  time  program  execution 
resides  in  the  composite  regions  R1  and  R2: 

int  ncycles; 

assert  (0  <  ncycles  <  maxcycles)  in  (ncycles)  restricted _to  (R I  ,R2); 


Invariant:  Example  4 


int  ncycles,  mcycles  assert  (0  <  allvars  <  maxcycles ); 

This  is  similar  to  the  previous  example,  except  for  the  fact  that  the  invariant  is  now  valid  for  both  of  the 
variables  mentioned  in  the  declaration,  i.e.,  ncycles  as  well  as  mcycles,  due  to  the  use  of  the  quantifier 
allvars . 

Invariant:  Example  5 


int  i,  j,  k; 

int  *  ip;  assert  (( range(ip )  =  (&z,  &&))  A  (hub  <  *ip  <  upb )) 

The  integer  pointer  variable  ip  is  restricted  to  point  to  the  variables  i  and  k;  in  addition,  the  values  of  the 
variables  pointed  to  by  ip  must  always  lie  between  the  bounds  Iwb  and  upb. 

Invariant:  Example  6 


float  function  fl(  float  x,y,z); 
error  ( numerical  — >•  hw  1  (...) ; 
SECDED  — >  hw2(...); 
else  hw3(...)) 
repeat  (k); 
assert  (w(...)); 

assert ( post (u(...)))  error (hu(...)); 
{•••}; 
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assert  ( pre  (p(a,b,c)),  post  (q(v, ...)))  in  (L2); 


L2:  v=fl(a,b,c); 

This  is  an  extension  of  the  second  version  of  Status  Predicate  Example  2.  Everything  that  has  been  said 
there  is  still  valid.  In  addition,  an  error  handler  and  an  invariant,  w (...),  have  been  associated  with  the 
function.  This  means  that  throughout  any  invocation  of  the  function,  w(...)  needs  to  be  satisfied.  If,  during 
an  invocation,  an  error  of  class  numerical  or  SECDED  occurs,  the  error  handlers  liw  I  (...)  or  hw2(...)  arc 
respectively  activated.  If  an  error  of  any  other  class  occurs,  error  handler  liw3 (...)  is  activated.  If  such  an 
recovery  action  is  unsuccessful,  the  function  activation  is  repeated  up  to  k  >  1  times  (see  Section  3). 

5  Tolerance  Directives 

5.1  Syntax 

1.  tolerance-directive  — >•  tolerance-clause  [ tolerance-target ]  [ standard-error-recovery ] 

2.  tolerance-clause  — >  variable-tolerance-clause  \  global-tolerance-constraint 

3.  variable -tolerance-clause  — >  error-tolerance-clause  \  tolerant-memory-clause 

4.  error-tolerance-clause  — >  “  tolerate  (”  tolerance-expression  “)” 

5.  tolerance-expression  -*  boolean-violation-permit-expression 

6.  violation-permit  — >  “(”  threshold  error-class  “)” 

7.  threshold  ->  integer-expression  |  “  accumulated  (”  integer-expression  “)”  |  “  access -percentage  (”  number- 
of-variable-accesses,  error-percentage  “)”  |  all 

8.  number-of-variable-accesses  — >  integer-expression 

9.  error-percentage  -^floating-point-expression 

10.  tolerant-memory-clause  —r  “  tolerant  memory  (”  integer-expression  “)” 

11.  tolerance-target  — >  variable-tolerance -target  \  global-constraint-tolerance -target 

12.  variable-tolerance-target  — >•  aggregate-data-region 

13.  global-constraint-tolerance-target  —r  aggregate-data-region  \  aggregate-program-region 

14.  global-tolerance-constraint  — >  “  constrain  tolerance  (”  boolean-expression  “)” 
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5.2  Overview 


A  tolerance  directive  instructs  the  system  to  ignore  certain  errors  that  occur  during  the  manipulation  of 
data.  Its  specification  contains  up  to  three  components:  (1)  a  mandatory  tolerance  clause,  (2)  an  optional 
tolerance  target,  and  (3)  an  optional  standard  error  recovery. 

A  tolerance  clause  is  either  a  variable  tolerance  clause  or  a  global  tolerance  constraint.  The  target  of  a 
variable  tolerance  clause  is  an  aggregate  data  region,  and  the  tolerance  clause  in  this  case  expresses  the  error 
permissions  for  the  data  objects  belonging  to  that  region.  A  global  tolerance  constraint  can  be  associated 
with  an  aggregate  data  region  or  an  aggregate  program  region;  it  specifies  a  constraint  over  a  set  of  variable 
tolerance  clauses  in  that  region. 

If,  during  execution  in  its  target  region,  the  condition  expressed  in  a  tolerance  clause  is  violated,  an  error 
of  type  tolerance  violation  is  raised.  An  explicit  check  for  the  validity  of  a  tolerance  clause  can  be  expressed 
via  the  built  Jn  function  check.tolerance  (v)  (see  Section  4.5). 

5.3  Variable  Tolerance  Clauses 

We  distinguish  two  different  types  of  variable  tolerance  clauses:  first,  an  error  tolerance  clause  lists  specific 
error  classes  that  should  be  tolerated,  together  with  a  threshold  limiting  the  number  of  permitted  violations. 
Secondly,  a  tolerant  memory  clause  directs  the  compiler  to  allocate  “tolerant  memory”  for  the  associated 
data  objects.  It  results  in  the  suppression  of  a  limited  number  of  memory  errors. 

5.3.1  Error  Tolerance  Clause 

An  error  tolerance  clause  consists  of  the  keyword  tolerate ,  followed  by  a  parenthesized  tolerance  expres¬ 
sion,  which  is  a  boolean  expression  formulating  a  logical  condition  over  violation  permits.  Each  violation 
permit  consists  of  an  error  class  together  with  a  threshold;  it  determines  a  boolean  value  as  explained  below. 

Assume  that  the  error  tolerance  clause  in  consideration  is  associated  with  variables  v\,...vn,  where  n  >  1, 
and  (t ,  c)  is  a  violation  permit,  where  t  is  a  threshold,  and  c  an  error  class.  Depending  on  t,  the  meaning  of 
the  violation  permit  (t.  c )  is  defined  as  follows: 

1 .  t  is  an  integer  expression 

In  this  case,  the  value  of  t  must  be  >  0  and  the  violation  permit  is  satisfied  iff  for  each  of  the  variables 
Vj ,  1  <  i  <  n,  there  arc  at  most  t  occurrences  of  error  class  c.  If  t  =  0,  then  the  effect  of  the  violation 
permit  is  the  same  as  if  it  was  absent,  i.e.,  any  occurrence  of  an  error  of  class  c  in  any  of  the  variables 
results  in  an  error. 

2.  t  =  accumulated  ( t ') 

Here,  t!  is  an  integer  expression  with  a  value  >  0.  The  violation  permit  is  satisfied  iff  the  number  of 
occurrences  of  error  class  c,  accumulated  over  all  of  the  variables  vt,  1  <  i  <  n,  is  at  most  t' . 

3.  t  =  access.percentage  ( k ,  r) 

Here,  k  is  an  integer  expression  with  a  (usually  large)  positive  value,  and  r  is  an  expression  yielding 
a  floating  point  number  in  the  range  0  <  r  <  100.  The  violation  permit  is  satisfied  iff  for  each  of  the 
variables  v%,  1  <  i  <  n,  and  for  any  k  subsequent  accesses  to  that  variable  at  most  rk/ 100  errors  of 
class  c  occur. 
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4.  t  =  all 

In  this  case,  there  is  no  limit  to  occurrences  of  error  class  c  in  any  of  the  Vi,  1  <  i  <  n:  the  violation 
permit  is  always  satisfied,  independent  of  how  many  errors  of  that  type  occur. 

5.3.2  Examples  for  the  Error  Tolerance  Clause 
Error  Tolerance  Clause:  Example  1 


float  A,  B,  C;  tolerate  ((nl,  Inf )  A  (n2,NaN)); 

float  D;  tolerate  ((n3,  SECDED )); 

float  E,  F;  tolerate (( accumulated (n4),  ANY)); 

float  *ip  1 ;  tolerate  (( access  .percentage  (1000,2),  SECDED )); 

float  G;  tolerate  ((all,  ANY)); 

Here,  let  nl,  n2,  n3,  and  n4  denote  integer  expressions,  all  yielding  values  >  0.  The  error  tolerance  clauses 
specified  above  have  the  following  meaning: 

•  During  the  manipulation  of  the  floating  point  variables  A,  B,  and  C  nl  Inf  and  n2  NaN  errors  are 
tolerated  for  each  of  these  variables.  No  other  errors  arc  tolerated. 

•  The  manipulation  of  the  floating  point  variable  D  permits  n3  SECDED  errors.  No  other  errors  arc 
tolerated. 

•  The  manipulation  of  variables  E  and  F  tolerates  an  accumulation  of  at  most  n4  arbitrary  errors  in  both 
variables  together,  i.e.,  the  number  of  errors  occurring  during  the  manipulation  of  D  plus  the  number 
of  errors  occurring  during  the  manipulation  of  F  must  not  exceed  n4. 

•  The  manipulation  of  the  floating  point  values  of  variables  pointed  to  by  ipl  tolerates  at  most  2% 
SECDED  errors  for  any  1000  successive  accesses  to  these  variables. 

•  The  tolerance  for  variable  G  is  the  most  encompassing:  all  errors  of  any  type  that  occur  during  ma¬ 
nipulation  of  G  are  tolerated. 

Error  Tolerance  Clause:  Example  2 


float  A,  B,  C; 
float  D; 

tolerate  ((nl.  Inf )  A  (n2,  NaN))  in  (A,  B,  C); 
tolerate ((n3, SECDED))  in(D); 

The  semantics  of  this  example  arc  the  same  as  for  the  first  two  lines  in  the  previous  one.  The  difference  is 
in  the  syntax,  which  allows  for  the  separation  of  declarations  and  the  associated  error  tolerance  clauses. 
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Error  Tolerance  Clause:  Example  3 


float  H; 

tolerate  (( all ,  Inf ))  in  (H)  restricted  to  (fl); 
float  function  fl(...)  {  /*  function  body  */  }; 

Here,  the  tolerance  directive  associated  with  variable  H  is  imposed  during  program  execution  in  each  invo¬ 
cation  of  function//,  and  is  not  in  effect  elsewhere.  All  occurrences  of  Inf  errors  arc  tolerated. 

Error  Tolerance  Clause:  Example  4  -  Numerical  Representation  Errors 

float  J  tolerate  ((nl,  mantissa  [7:0])); 
int  K  tolerate  ((n2,  bits  [3 1:16])); 

Let  nl  and  n2  denote  integer  expressions  with  values  >  0.  The  manipulation  of  the  floating  point  variable 
J  tolerates  nl  errors  in  the  least  significant  8  bits  of  the  mantissa.  No  other  errors  arc  permitted. 

The  manipulation  of  the  integer  variable  K  tolerates  n2  errors  in  the  upper  2  bytes  of  its  representation. 
No  other  errors  arc  permitted. 

Error  Tolerance  Clause:  Example  5  -  Standard  Error  Recovery  in  a  Data  Region 

float  A  tolerate  ((nl,  NaN )  A  ((n2.  Inf)  A  (n3,  SECDED )) 
error  (NaN  NaN_violation_handler(nl,...); 

Inf  — »•  Inf_violation_handler(n2,...); 

SECDED  — >  secded_error_handler(n3,...); 
else  other_handler(error-class, ...)); 

Here,  let  nl,  n2,  and  n3  denote  integer  expressions,  all  yielding  values  >  0.  The  tolerance  directive  specified 
above  has  the  following  meaning:  During  the  manipulation  of  the  floating  point  variable  A,  nl  NaN  arith¬ 
metic  errors,  n2  Inf  arithmetic  errors,  and  n3  SECDED  errors  are  tolerated.  No  other  errors  are  tolerated. 

Two  types  of  errors  can  occur  in  this  constellation:  first,  there  may  be  errors  that  arc  not  listed  in  the 
violation  permit,  i.e.,  errors  different  from  NaN ,  Inf,  and  SECDED  .  These  arc  called  non-tolerated  errors. 
Secondly,  there  may  occur  more  NaN ,  Inf,  and/or  SECDED  errors  than  arc  tolerated  by  the  directive.  Error 
handling  is  able  to  distinguish  these  cases  and  treat  them  separately  as  follows: 

•  The  occurrence  of  more  than  nl  NaN  errors  is  handled  by  the  NaN  violation  handler. 

•  The  occurrence  of  more  than  n2  Inf  errors  is  handled  by  the  Inf  .violation  Jiandler. 

•  The  occurrence  of  more  than  n3  SECDED  errors  is  handled  by  the  seeded  .error  Jiandler. 

•  The  occurrence  of  any  other  error  results  in  the  activation  of  error  handler  other  .handler (error- 
class,...). 

Note  that  the  semantics  of  the  error  clause  is  slightly  different  from  that  discussed  in  Section  3  in  that  a 
tolerated  error  is  treated  in  the  same  way  as  no  error. 
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Error  Tolerance  Clause  Example  6  -  Multiple  Error  Handlers  for  Multiple  Variables 


float  D,E;  tolerate  (( accumulated  (n4),  numericaLerror )) 

error  ( numericaLerror  — >  numericaLerror  _handler(  accumulated  ,n4,D,nD,E,nE,...); 
else  other_handler(error-class, ...)); 

Let  n4  denote  an  integer  expression  yielding  a  value  >  0.  The  tolerance  directive  has  the  following  meaning: 

The  manipulation  of  the  variables  D  and  E  tolerates  an  accumulation  of  at  most  n4  numerical  NaN 
errors  in  both  variables  together,  i.e.,  the  number  of  such  errors  occurring  during  the  manipulation  of  D 
plus  the  number  of  errors  occurring  during  the  manipulation  of  E  must  not  exceed  n4.  In  the  case  this 
limit  is  violated,  the  numerical  error  handler  is  activated.  Its  arguments  include  the  limit,  n4,  and  each 
affected  variable — D,  E — together  with  the  number  of  error  occurrences  caused  by  that  variable — nl)  and 
nE  respectively.  As  before,  the  occurrence  of  any  other,  i.e.,  non-tolerated,  error  triggers  the  activation  of 
nteJiandler. 

5.3.3  Tolerant  Memory  Clause 

A  tolerant  memory  clause  consists  of  the  keyword  tolerant  memory  followed  by  an  integer  expression, 
which  must  yield  a  positive  value,  t. 

This  directive  specifies  that  memory  for  the  associated  data  objects  is  to  be  allocated  in  an  error-tolerant 
memory  region.  Such  a  region  is  subject  to  a  special  set  of  rules:  the  operating  system  keeps  track  of 
uncorrectable  errors  during  execution  in  this  region,  but  takes  no  further  action  as  long  as  the  number  of 
errors  does  not  exceed  t. 

The  allocation  of  data  objects  in  an  error-tolerant  memory  region  can  be  performed  in  the  C  program¬ 
ming  language  by  calling  a  special  version  of  the  malloc  routine. 

5.3.4  Example  for  Tolerant  Memory  Clause 


atype  A;  tolerant  memory  (t) 

A  =  tolerant  m  a  1 1  o  c  ( s  i  z  e  o  l'(  atype )); 

Variable  A  is  declared  of  type  atype .  The  associated  directive  specifies  A  to  be  allocated  in  an  error- 
tolerant  memory  region,  with  a  threshold  of  maximum  t  errors  to  be  tolerated.  The  call  to  the  special  routine 
tolerant  jnalloc  performs  the  requested  allocation. 

5.4  Global  Tolerance  Directives 

The  features  provided  by  variable  tolerance  directives  do  not  offer  an  easy  way  to  specify  a  logical  condition 
involving  multiple  such  directives.  This  can  be  achieved  via  a  global  tolerance  constraint.  We  illustrate 
the  use  of  such  a  constraint  with  an  example: 

HPC_Sparse_Matrix  *  A; 

float  *  x,  *y; 
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tolerate  ((1,SECDED))  in(*A,  *x,  *y); 


constrain.tolerance  (*A  xor  *x  xor  *y) 

In  this  example,  an  identical  variable  tolerance  clause  is  specified  for  the  values  pointed  to  by  A,  x,  and 
y.  The  global  tolerance  constraint  imposes  an  additional  restriction  in  that  it  allows  a  SECDED  error  only 
to  occur  for  at  most  one  of  these  variables  ( xor  represents  the  exclusive  or  operator). 

6  Pragmas 

6.1  Syntax 

1 .  pragma  — >  pragma-clause  [ pragma-target ]  [ standard-error-recovery ] 

2.  pragma-clause  —y  voltage-clause  \  other-pragma-clause 

3.  voltage-clause  -y  “  voltage  options  C  voltage -variable  probability -specification- list  “)” 

4.  probability-specification  —y  undetected-error-probability-specification  \  detected-error-probability- 
specification  \  any-error-probability-specification 

5.  undetected-error-probability-specification  —y  “  p_undetected_errors  =” expression ; 

6.  detected-error-probability-specification  -y  “  p  detected  errors  =” expression ; 

7.  any-error-probability-specification  —y  “  p.errors  =” expression', 

8.  pragma-target  —y  aggregate-data-region  \  aggregate-program-region 

6.2  Overview 

The  main  purpose  for  introducing  pragmas  in  this  document  is  as  a  placeholder  for  future  extensions  of  the 
language.  In  this  version  of  AL,  our  only  focus  is  on  a  special  class  of  pragmas  that  support  control  of  the 
tradeoff  between  energy  and  reliability. 

The  connection  between  energy  efficiency  and  reliability  can  be  exploited  at  a  hierarchy  of  architecture 
and  software  levels.  In  essentially  every  computer  in  production  today,  a  logic  circuit  can  be  designed  for 
the  control  of  power  consumption  and  measured  for  error  rate.  Managing  this  tradeoff  should  be  possible 
when  new  devices  intended  to  extend  Moore’s  Law  are  perfected.  The  power  supply  to  these  devices  could 
be  generated  by  a  Digital  to  Analog  Converter  (DAC).  By  changing  the  power  supply  voltage,  the  speed, 
power,  and  error  rate  of  devices  could  be  changed  in  nanoseconds.  It  would  be  possible  to  build  a  computer 
where  such  changes  could  be  applied  to  composite  program  regions  under  program  control  thus  leading  to 
algorithms  that  would  be  able  to  control  the  tradeoff  between  reliability,  energy  efficiency,  and  performance, 
and  to  more  closely  approach  energy  efficiency  limits. 

Pragmas  are  specified  using  a  structure  similar  to  that  used  for  assertions  and  tolerance  directives,  as 
discussed  in  Sections  4  and  5.  Their  specification  consists  of  up  to  three  components:  (1)  a  mandatory 
pragma  clause,  (2)  an  optional  pragma  target,  and  (3)  an  optional  standard  error  recovery.  In  general,  a 
pragma  target  can  be  either  an  aggregate  data  region  or  an  aggregate  program  region.  In  both  cases,  a 
standard  error  recovery  can  be  combined  with  the  construct. 
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6.3  Voltage  Pragmas 

Voltage  pragmas  arc  associated  with  aggregate  program  regions.  They  provide  information  about  the  rela¬ 
tionship  between  the  voltage  at  which  a  composite  region  is  executed  and  the  resulting  probability  of  error 
occurrences  during  that  execution.  Based  on  this  information  the  compiler  and/or  runtime  system  can  select 
a  mode  of  operation  that  depends  on  the  error-handling  capabilities  specified  for  this  region,  and  on  the 
environment  in  which  it  is  executed.  For  example,  in  the  presence  of  an  elaborate  error-recovery  capability 
associated  with  a  function,  the  system  may  decide  to  execute  the  function  at  a  low  voltage,  saving  energy  but 
accepting  the  risk  of  a  high  number  of  error  occurrences.  On  the  other  hand,  if  an  inadequate  error  response 
has  been  provided  for  the  function,  the  system  may  choose  to  executing  it  at  a  voltage  that  minimizes  the 
risk  of  voltage-induced  error. 

In  general,  a  voltage  clause  has  the  form 

voltage  options  (v,  probability _specification-list) 

where  v  identifies  a  voltage  variable,  and  the  probability  specification-list  may  contain  the  following  ele¬ 
ments: 

1.  An  undetected  error  probability  specification  of  the  form: 

p  undetected  errors  =pu (v) 

where  pu(v)  is  an  expression  specifying  the  probability  for  undetected  errors  resulting  from  an  exe¬ 
cution  at  voltage  v. 

2.  A  detected  error  probability  specification  of  the  form: 

p  detected  errors  =pd(v) 

where  pd(v)  is  an  expression  specifying  the  probability  for  detected  errors  resulting  from  an  execution 
at  voltage  v. 

3.  An  any  error  probability  specification  of  the  form: 

p  .errors  =pf:(v) 

where  pe(v)  is  an  expression  specifying  the  probability  for  any  errors  resulting  from  an  execution  at 
voltage  v. 

The  probability  specification  list  may  consist  of  exactly  one  of  the  above  three  options,  or  of  a  combina¬ 
tion  of  the  first  two  options. 
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6.4  Examples  for  Voltage  Pragmas 

Voltage  Pragmas:  Example  1 


float  function  f(  float  x,...) 
voltage_options  (v,  p_undetected_errors  =p  1  (v), 
p  detected  errors  =p2(v)); 
error (ci  ->•  hi{. cn  ->•  hn(...)); 

{•••}; 

In  this  example,  the  pragma  voltage.options  (...)  is  contextually  associated  with  the  declaration  of  function 
/.  The  probabilities  for  undetected  as  well  as  detected  errors  are  respectively  specified  as  pl(v)  and  p2(v) 
depending  on  the  voltage,  v,  chosen  for  an  execution  of  the  function.  Furthermore,  error  handlers  hi,  ...hn 
for  error  classes  ci,  ...cn  are  associated  with  the  function. 

Voltage  Pragmas:  Example  2 

Here  we  consider  a  situation  where  a  function  is  embedded  in  a  hierarchical  program  structure  and  may  be 
executed  at  different  voltages  depending  on  circumstances  in  its  environment.  Function  arguments  in  this 
example  are  respectively  characterized  as  input,  input/output,  or  output  by  the  qualifiers  in ,  inout .  and  out . 

float  function  g(  float  in  x,...) 
voltage  options  (v,  p  undetected  errors  =p3(v), 
p_detected_errors  =p4(v)); 

{•••}; 

Function  g  has  a  floating  point  input  argument,  x,  and  delivers  a  value  of  type  float .  The  voltage  pragma 
provided  for  the  function  specifies  respective  error  probabilities  p3(v)  and  p4(v)  for  undetected  and  detected 
errors,  depending  on  the  operating  voltage  at  which  an  invocation  of  g  is  executed. 

Assume  now  that  pi  is  a  function  with  input/output  argument  x  that  calls  g,  but  does  not  have  any 
error-handling  capabilities  associated  with  its  definition: 

function  gl(  float  inout  x)  {x=g(x)}; 

gl(A); 


The  invocation  pi  (A)  leads  to  a  call  of  g(A),  with  its  result  assigned  to  variable  A  if  the  execution  is 
successful.  However,  the  value  of  A  may  be  corrupted  in  the  case  of  an  error  occurring  during  the  execution 
of  g(A),  possibly  in  addition  to  the  corruption  of  other  data.  A  compiler  or  runtime  analysis  of  the  environ¬ 
ment  in  which  pi  (A)  is  called  may  determine  that  such  an  error  is  unrecoverable.  In  this  case,  the  system 
may  decide  to  execute  this  call  at  a  high  voltage  to  meet  reliability  requirements. 

On  the  other  hand  let  function  p 2  be  defined  as  follows: 
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function  g2(  float  out  x,  in  y)  error  (hl(...))  repeat (k,  finally  (h2(...)))  save(B,C,D); 

{x=g(y)}; 

t=A; 

g2(A,t); 


In  contrast  to  g  1  above,  the  definition  of  function  g2  with  output  argument  x  and  input  argument  y  has 
associated  with  it  an  explicit  error  handling  specification:  in  the  occurrence  of  an  error,  error  handler  h 
is  called  first,  and  the  function  is  executed  repeatedly  up  to  k  times  if  /)  !(...  )  cannot  successfully  handle  the 
error.  If,  after  k  repeated  executions,  errors  still  occur,  error  handler  h‘2(...)  is  activated.  The  save  directive 
directs  the  system  to  save  the  values  of  variables  B,C,D  before  the  initial  invocation  of  g2(A,  t);  the 
original  values  of  these  variables  are  restored  before  each  repeated  execution  caused  by  an  error  occurrence. 

The  system  may  at  the  beginning  choose  to  execute  g2  at  a  low  voltage,  but  increase  the  voltage  for 
repeated  executions.  However,  lowering  the  voltage  initially  too  far  could  cause  so  many  retries  that  the 
total  power  consumption  actually  increases.  This  could  lead  to  a  system  strategy  for  optimizing  the  voltage 
initially  chosen  for  minimum  energy  consumption. 

7  Conclusion 

We  conclude  this  document  by  discussing  a  number  of  ideas  on  extending  the  present  specification  of  the 
assertion  language  in  Section  7.1,  followed  by  addressing  some  inherent  semantic  limitations  of  the  language 
in  Section  7.2. 

7.1  Potential  Future  Extensions  of  the  Assertion  Language 

There  arc  some  obvious  areas  where  the  present  AL  specification  can  be  refined  and/or  generalized  without 
affecting  its  basic  structure  and  concepts.  They  include: 

•  Dynamic  management  of  assertion  language  constructs 

•  Extension  of  tolerance  directives  by  approximate  types 

•  Extended  support  for  error  recovery 

•  Introduction  of  additional  pragma  categories 

•  Definition  and  management  of  user-defined  error  classes 

•  More  precise  classifications  of  data  and  program  regions 

It  is  expected  that  such  changes  will  be  implemented  as  a  result  of  receiving  feedback  from  compilation 
of  the  language  and  from  its  application  to  critical  application.  The  first  two  of  these  potential  extensions  will 
be  respectively  discussed  in  Sections  7.1.1  and  7.1.2.  A  much  further-reaching  extension — the  introduction 
of  support  for  parallel  computations — will  be  outlined  in  Section  7.1.3. 
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7.1.1  Dynamic  Management  of  Assertion  Language  Constructs 

The  discussion  in  this  document  assumed  that  an  assertion  language  construct  appealing  in  a  source  program 
was  always  active.  However,  compiler  and/or  runtime  analysis  may  result  in  the  desirability  to  dynamically 
manage  such  constructs,  e.g.,  by  turning  them  off  and  on  depending  on  the  results  of  analysis.  Knobs,  as 
proposed  by  Bernholdt  et  al.  generalize  such  boolean  guards  [2].  They  arc  mechanisms  that  can  control  the 
cost,  in  terms  of  performance  and  energy,  associated  with  the  execution  of  assertion  language  constructs. 
For  example,  they  allow  flexible  control  of  the  strength  of  error  detectors  such  as  checksum  procedures  and 
the  frequency  of  their  activation. 

7.1.2  Approximate  Types 

The  semantics  of  most  common  programming  languages  specifies  in  detail  the  mathematical  properties  to  be 
obeyed  by  the  representation  and  processing  of  numerical  types.  Tolerance  directives,  as  defined  in  Section  5 
of  this  document,  support  some  relaxation  of  these  strict  rules.  A  different  approach  for  the  manipulation 
of  numerical  values  could  follow  the  concept  of  approximate  types  as  proposed  in  the  EnerJ  language  [11]. 
Approximate  types  provide  flexibility  for  value  representation  and  operation  implementation,  with  imple¬ 
mentation  details  remaining  system  and  application  dependent.  The  underlying  idea  is  to  suppress  certain 
numerical  errors,  in  addition  to  providing  the  implementation  with  the  opportunity  for  achieving  potentially 
higher  performance  and  reduced  energy  consumption  by  exploiting  the  weakened  requirements  for  repre¬ 
sentations  and  operations.  The  introduction  of  approximate  types  leads  to  a  separation  of  computations  that 
must  be  executed  in  a  mathematically  precise  fashion  from  approximate  computations.  If  approximate  type 
computations  are  mixed  with  precise  type  computations,  care  must  be  taken  not  to  destroy  the  hard  correct¬ 
ness  of  precise  type  computations.  Specifically,  the  assignment  of  the  value  of  an  approximate  type  variable 
to  a  precise-type  variable  needs  special  consideration. 

7.1.3  Assertion  Language  Support  for  Parallelism 

The  current  AL  specification  is  based  upon  a  sequential  programming  model.  In  view  of  the  massive  paral¬ 
lelism  characterizing  exascale  computations  the  scope  of  AL's  applicability  could  be  significantly  enhanced 
by  introducing  support  for  parallel  computations.  For  example,  new  AL  constructs  could  support  fault  tol¬ 
erance  for  synchronous  and  asynchronous  parallelism  such  as  forall  loops,  parallel  thread  scheduling,  and 
associated  synchronization  features.  Error  control  would  have  to  be  generalized  to  deal  with  errors  in  these 
new  constructs,  and  at  the  same  time  include  parallelism  in  its  recovery  routines.  Predicates  could  take 
into  account  the  distribution  of  large  data  structures  across  different  memories,  addressing  related  consis¬ 
tency  issues  and  the  resulting  balance  of  computations.  Tolerance  directives  could  address  specific  errors  in 
parallel  computations  that  could  be  ignored  without  affecting  the  (soft)  correctness  of  computations.  And 
pragmas  could  be  generalized  to  control  the  degree  of  parallelism  of  a  construct  depending  on  the  tradeoff 
between  gain  in  computational  performance  and  the  overhead  of  synchronization  and  communication.  Such 
generalizations  would  result  in  a  major  generalization  of  AL,  which  we  believe  could  be  made  without  a 
breakdown  of  the  basic  structure  of  AL  as  specified  at  this  time. 

7.2  On  the  Semantic  Limits  of  the  Assertion  Language  and  Its  Implementation 

Assertions  provide  the  means  to  formulate  an  error  detector  along  the  lines  of  Huang-Abraham's  Algorithm- 
Based  Fault  Tolerance  (ABFT)  approach  [5].  For  example,  a  call  to  a  checksum  computation  can  be  ex- 
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pressed  with  a  status  predicate  attached  to  a  single-point  region.  However,  AL  is  not  designed  to  handle 
error  detection  capabilities  that  exploit  deep  semantic  properties  of  the  algorithm  and  its  data  structures. 
Such  features  must  be  explicitly  included  by  the  programmer  in  the  algorithm  and  the  related  data  structure 
declarations. 

We  illustrate  these  relationships  with  an  example  for  a  code  section,  which  performs  a  fault-tolerant 
computation  of  the  product  y  =  A*  x  for  matrix  A  and  vector  x: 

/*  Original  data  structures  arc  dimensioned  A(  1  :  m,  1  :  n ),  x(l  :  n),y(  1  :  m)  */ 

/*  Declaration  of  extended  data  structures:  */ 

float  AA(l:m+l,l:n);  /*  Row  AA{m  +  1, 1  :  n)  is  added  to  the  original  data  structure  */ 
float  xx(l:n+l);  /*  Element  xx(n  +  1)  is  added  to  the  original  input  vector  */ 
float  yy(l:m+l);  /*  Element  yy(m  +  1)  is  added  to  the  original  result  vector  */ 

/*  Assume  AA(i,j)  to  be  defined  for  1  <  i  <  m,  1  <  j  <  n,  and  xx(j)  for  1  <  j  <  n  */ 

/*  AA(m  +  1,  j)  is  now  computed  as  AA(m  +  1,  j)  =  ^i=i:rn(A(i,j))  for  all  j,  l  <  j  <  n*/ 

/*  xx(n  +  1)  is  computed  as  xx(n  +  1)  =  Y,j=i:n{x(j))  */ 

/*  compute  the  matrix-vector  product:  yy(  1  :  m)  :=  AA(  1  :  m,  1  :  n)  *  xx{\  :  n)  */ 

/*  compute  yy{m  +  1)  :=  AA(m  +  1)  *  xx(l  :  n)  */ 

assert  (post(yy(m  +  1)  =  E,=  | :m(yy(i))))  error  (checksum_error_handler(AA,  xx,  yy)); 


In  the  case  that  the  postcondition  for  the  last  statement  fails,  the  checksum  error Jiandler(...)  will  be 
activated  and  given  the  task  to  perform  recovery  from  that  error.  However,  we  do  not  assume  that  the  system 
in  which  the  implementation  of  AL  is  embedded  has  sufficient  knowledge  about  the  algorithmic  and  data 
structure  semantics  to  deal  with  this  problem  successfully  on  its  own.  That  is,  any  information  required  for 
the  analysis  of  and  recovery  from  such  an  error  must  be  explicitly  provided  in  the  error  handler. 
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Figure  1 :  FailSafe  System  Overview 
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